TLDR;
Recently I had a situation where we had 400+ wildcard tables that were not named correctly. I used a single SQL statement to programatically fetch all the table names and rename them based on a regex.
What are wildcard tables and what can you do with them
Wildcard tables are individual tables that have the same preffix e.g.:
sales_20220101
,sales_20220102
, etc.
For more details see the official documentation.
BigQuery has a filter function _TABLE_SUFFIX
that allows you to do some clever stuff with wildcard tables:
SELECT
count(*)
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE _TABLE_SUFFIX BETWEEN '29' and '35'
This will select all the tables 1929, 1930 ... 1935
.
I personally prefer partitioned tables, but wildcard have their uses. Namely when you load data from a 3rd party and you expect that they might change their schema at some point in the future.
More on that in another article.
The problem
Tables were annoyingly named
ABC_20220101_0700_BLA
which made using_TABLE_SUFFIX
impossible.Luckily with a simple string subset I can easily get to the correct naming schema
ABC_20220101
.However the biggest issue was the volume of tables (over 400), so anything manual was out of the question.
The solution
My first thought was to use the BigQuery Python API, but I decided to challenge myself and use just SQL.
There are several features in BQ SQL that I used:
Data definition language (DDL) - you know this from
CREATE TABLE
orDROP TABLE
, But did you know there is
ALTER TABLE as wellProcedural language - Most SQL users will be familiar with
DECLARE
andSET
keywords. But BigQuery SQL allows you add complex logic likeIF
andWHILE
statements. In this problem I used FOR ININFORMATION_SCHEMA
is a special "dataset" that has several views which give you a lot of metadata about your warehouse (datastes, tables, views, etc.).
Read more about it here.
The complete code:
FOR t IN
(
SELECT
table_id as old_name,
regexp_extract(table_id, r"ABC_\d{8}") as new_name
FROM my_dataset.__TABLES__
)
DO
EXECUTE IMMEDIATE (
"ALTER TABLE my_dataset." || t.old_name || " RENAME TO " || t.new_name
);
END FOR;
Explanation:
FOR t IN
(
-- some query that we loop over
)
DO
-- run some other query
END FOR;
- The
FOR
loop allows us to itterate over each row of any query
SELECT
table_id as old_name,
FROM my_dataset.__TABLES__
-
my_dataset.__TABLES__
wil return a table that has a lot of information about each table in this dataset. I only cared about the name of the tables:table_id
regexp_extract(table_id, r"ABC_\d{8}") as new_name
- The regular expression only takes the literal string
ABC_
and then 8 digits
EXECUTE IMMEDIATE (
"ALTER TABLE my_dataset." || t.old_name || " RENAME TO " || t.new_name
);
-
ALTER TABLE my_dataset.x RENAME TO y
does exactly what it says on the tin - renames a table. (Don't forget to add the dataset) -
EXECUTE IMMEDIATE
allows us to run a string as a query. Think of it aseval
in Python. - BigQuery SQL syntax will not let us to
SELECT ... FROM my_dataset.some_var
because it will treatsome_var
as literal rather than replacing it with its value. - That is why we need to first build a string of the query. And that's what the
||
chars are for. They just concatenate strings. - Here is what the evaluated string looks like:
"ALTER TABLE my_dataset.ABC_123_BLA RENAME TO ABC_123_BLA "
Final thoughts:
I was getting some weird error:
Not found: Dataset my-project:my_dataset was not found in location US at [3:7]
It turns out that using the bacticks "`" around my table and dataset names was causing this - so I removed them and it all worked 🤷
Top comments (0)