DEV Community

Cover image for Automatically Update BigQuery View Schema Changes
matthieucham for Stack Labs

Posted on

Automatically Update BigQuery View Schema Changes

SQL views are virtual tables simplifying data access and security. They offer tailored data perspectives, protecting sensitive information. Data analysts widely use them to streamline modeling.

As such, views are a crucial feature of Google Cloud's fully managed data warehouse, BigQuery. However, they have certain limitations. One of these limitations can be particularly troublesome for data analysts and end-users:

The schemas of the underlying tables are stored with the view when the view is created. If columns are added, deleted, or modified after the view is created, the view isn't automatically updated and the reported schema will remain inaccurate until the view SQL definition is changed or the view is recreated. Even though the reported schema may be inaccurate, all submitted queries produce accurate results.

To see this limitation into action, create a source table with two columns

CREATE OR REPLACE TABLE `demo_devto.source_table` (
  A STRING,
  B STRING
) AS (
  SELECT "a", "b" 
)
Enter fullscreen mode Exit fullscreen mode

Then create a view above it

CREATE OR REPLACE VIEW `demo_devto.expo_view` AS (
  SELECT * from `demo_devto.source_table`
)
Enter fullscreen mode Exit fullscreen mode

As expected, the schema of the view presents 2 columns A and B

Image description

Now add a column to the source table

ALTER TABLE `demo_devto.source_table`
  ADD COLUMN C STRING
Enter fullscreen mode Exit fullscreen mode

The new column is reflected by the source table's schema

Image description

But not by the view's schema

Image description

Still, the result of a query is correct with 3 columns

Image description

This article outlines a method to circumvent this limitation and maintain the view's schema in alignment with the underlying table's schema as closely as possible.

A fully serverless event-driven architecture to synchronize schemas

This solution make use of a log sink to capture audit logs from BigQuery, a PubSub topic where relevant log entries are directed, a PubSub subscription and a Cloud Run service to process them

Image description

Let's review each step and dive into details

1. Bigquery audit logs

All Google Cloud services generate logs which are viewable in Cloud Logging. BigQuery is no exception and audit logs offer all information we need. See their structure here

2. Cloud Logging log sink

A log sink is a location where the logs are collected and stored. Google Cloud Logging log sinks collect within a scope - project, folder, organization. So to capture update logs from tables for a whole organization, a log sink at organization level is needed. To monitor a project only, a sink at project level is enough.

A log sink must declare a filter. This is very important to limit costs - which depend of the volume of captured logs - and to process relevant events only. Here we are using the following filter to capture events about schema changes:

resource.type="bigquery_resource"
AND protoPayload.serviceName="bigquery.googleapis.com"
AND protoPayload.methodName="tableservice.update"
AND protoPayload.authenticationInfo.principalEmail !~ <regex identifying the service account used by the cloud run service who process logs>
Enter fullscreen mode Exit fullscreen mode

The filter on principalEmail serves as a mechanism to identify updates to exposition views made by the Cloud Run service, which we wish to exclude from processing as our focus lies solely on source table update events.

Finally, we need to give the sink a destination, where received logs who pass the filter are routed. Several kinds of destination are possible. Because our architecture is event-driven, the selected destination is a PubSub topic. The log entry is then encoded as JSON

Here is how to provision such a sink with Terraform, at project level:

resource "google_logging_project_sink" "demo" {
  provider               = google-beta
  project                = "my-project"
  name                   = "logsink-demo"
  destination            = "pubsub.googleapis.com/${google_pubsub_topic.demo.id}"
  filter                 = <<EOT
    resource.type="bigquery_resource"
    AND protoPayload.serviceName="bigquery.googleapis.com"
    AND protoPayload.methodName="tableservice.update"
    AND protoPayload.authenticationInfo.principalEmail !~ "^sa-demo@myproject.iam.gserviceaccount.com$"
  EOT
  unique_writer_identity = true
}

resource "google_pubsub_topic_iam_member" "demo" {
  provider = google-beta
  topic    = google_pubsub_topic.demo.id
  role     = "roles/pubsub.publisher"
  member   = google_logging_project_sink.demo.writer_identity
}
Enter fullscreen mode Exit fullscreen mode

3. PubSub topic and subscription

The PubSub topic is the destination of log events who pass the log sink filter.

To consume these events, a subscription in PUSH mode send these events to a HTTPS endpoint.

Here is an example of how these resources can be provisioned with Terraform:

resource "google_pubsub_topic" "demo" {
  provider = google-beta
  name     = "topic-demo"
}

resource "google_pubsub_subscription" "demo" {
  provider             = google-beta
  name                 = "sub-demo"
  topic                = google_pubsub_topic.demo.id
  ack_deadline_seconds = 600

  push_config {
    push_endpoint = <URL of the cloud run endpoint>
    oidc_token {
      service_account_email = google_service_account.default.email
    }
  }

  expiration_policy {
    ttl = ""
  }
}
Enter fullscreen mode Exit fullscreen mode

4. 5. and 6. Events processing

The processing of log events is performed by a Cloud Run service in this system, but could be done by a Cloud Function for example.

In Python, the decoding of incoming events can be done like this:

import base64
import json

bq_log = json.loads(base64.b64decode(message["data"]).decode("utf-8"))
Enter fullscreen mode Exit fullscreen mode

By parsing the bq_log object, we can retrieve the updated table id:

import re

from google.cloud.bigquery import TableReference

RESOURCENAME_PATTERN = re.compile(
    "^projects/(?P<project>[^/]+)/datasets/(?P<dataset>[^/]+)/tables/(?P<table>[^/]+)$"
)

resource_name = bq_log.get("protoPayload", {}).get("resourceName", "")
if (match := RESOURCENAME_PATTERN.match(resource_name)) is not None:
    return TableReference.from_api_repr(
        {k + "Id": match.group(k) for k in ["project", "dataset", "table"]}
    )
Enter fullscreen mode Exit fullscreen mode

The next step is to identify the views which relies on this source table. Here, associations between source tables and exposition views are registered in a Firestore database, but other designs are possible. For example, you could query INFORMATION_SCHEMA.VIEWS metadata views and identify the affected views by parsing the content of the VIEW_DEFINITION column

SELECT VIEW_DEFINITION FROM `demo_devto.INFORMATION_SCHEMA.VIEWS`
Enter fullscreen mode Exit fullscreen mode

Finally, synchronize all affected views. BigQuery views seem to not support the updating of the "schema" field by the update_table() method when columns are added. The recommended way is then to re-create views with SQL DDL statements:

CREATE OR REPLACE VIEW AS ...
Enter fullscreen mode Exit fullscreen mode

With all steps pieced together, any schema update from source tables automatically triggers the re-creation of exposition views, keeping the schema synchronized after a short delay !


Thanks for reading! Iā€™m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.


Cover picture by Miguel Delmar on Unsplash

Top comments (0)