To all BigQuery lovers around here (and others too !) : Google Cloud Next'21 is just over and there was an important part of announcements regarding Data !
Let’s see the latest news and functionalities of BigQuery, announced at Next or in the past weeks. (careful, some of them are still in Preview)
You want to migrate your existing DataWarehouse to BigQuery ? (congratulations, it’s probably a good idea 🎉) Checkout BigQuery Migration Service, a set of free tools to help you migrate. There is a particularly interesting one : the SQL translator. Accessible from the API or the Console, this tool will help you translate your current SQL queries into BigQuery Standard SQL language.
Only Teradata SQL is supported for now but let’s bet there are more to come !
There is now a support for transactions in BigQuery ! Yes, you’ve read well, transactions !
The functionality is called Multi-statement transactions and allows you to perform data modifications in one or more tables with ACID guarantees. During the transaction, all reads return a consistent version of the tables referenced in the transaction and any modification is either committed or rolled back.
Multi-statement transactions over multiple queries are started within a Session. The new session abstraction lets you separate users and applications from each other.
Heard of time travel in BigQuery ? It’s pretty useful but only allows you to go 7 days back. To store the state of a table for more than that, table snapshots are here to help : it allows you to preserves the contents of a table at a particular time and preserve this image for as long as you want. BigQuery will minimize the storage cost by only storing the bytes that are different between a snapshot and its base table.
Tips : think about periodic snapshots creation with the query scheduler
A new similar functionality to come is Clone. While table snapshots are immutable (you can restore them but not edit them directly), a clone is a mutable version of the base table. They allow you to clone a table and perform read/write/schema evolution operations. Pretty useful for testing production changes. Something nice : same as Snapshot, BQ will only bill you for the new data because it stores only the difference between the base table and the cloned one.
You have a machine learning model to train with BigQuery ML but you want to use only a subset of a table for the training set ? Try the table sampling functionality : Sampling returns a variety of records while avoiding the costs associated with scanning and processing an entire table (unlike the
LIMIT clause !)
-- this will give you a random 10% of a table data SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT)
Have you ever wanted to have parameters in views ? Now you have table functions (TVF) for that ! Table-valued functions allow you to create an SQL function that returns a table. You can see it just like create a view with parameters and call the result of this function in standard queries :
CREATE OR REPLACE TABLE FUNCTION mydataset.names_by_year(y INT64) AS SELECT year, name, SUM(number) AS total FROM `bigquery-public-data.usa_names.usa_1910_current` WHERE year = y GROUP BY year, name -- use your table function in an other query, just like a view SELECT * FROM mydataset.names_by_year(1950) ORDER BY total DESC LIMIT 5
You can use Authorized function to give specific user access to your TVF without them having access to the underlying table (just like Authorized view works for standard views)
There is now a new Write API to unify batch and streaming old APIs : the Storage Write API ! This new API gives you more control over the loading process and is more performant than the previous. Moreover this new API is cheaper than the legacy Streaming insert functionality while providing a free tier usage !
The multi-cloud analytics engine BigQuery Omni is going GA for AWS and Azure ! With Omni you can query large amounts of data in AWS S3 or Azure ADLS, without maintaining complex cross-cloud Extract-Transform-Load pipelines. The functionality will allow multi-cloud organisations to save cost on Egress and Join data between cloud providers and locations. The BigQuery console on GCP will become the central access point for analytics and you will be able to define governance and access control in a simple place !
After being in Preview for a while with Data Studio, BigQuery BI Engine is also going GA ! BI Engine is a reserved in-memory database used to obtain sub-second query results with any BI tool (Looker, Tableau, PowerBI, etc), even over very large amounts of data. This functionality prevents the use of OLAP Cube and complex ETL pipelines : Google automatically handles the move and the freshness of the data between standard BigQuery Storage and BI Engine. And this also works for streaming !
Historically, BigQuery does not allow restriction over the size of certain data types, but this is about to change : there is now a Parameterized data types syntax on
BIGNUMERIC. Want to raise an error if the text value in a certain column is larger than 10 characters ? Type your column as
Among many new Geography functions (not detailed here), here are two interesting new functions and Standard SQL syntax evolution that retained my attention :
UNPIVOTto turn rows into columns and columns into rows.
QUALIFY, a new SQL clause used to filter on the result of an analytical function without the need for a subquery ! It’s kind of like the
HAVINGto filter on a standard aggregation function.
SELECT item FROM Produce WHERE Produce.category = 'vegetable' QUALIFY RANK() OVER (PARTITION BY category ORDER BY purchases DESC) <= 3