Nelis Goeminne for Lighthouse

Posted on Jan 24, 2023

Why we use Terraform for BigQuery

#docker #cli #software #tooling

At OTA Insight we have been using BigQuery for many years as our main data warehouse for the massive amount of pricing data that we gather on a daily basis. Besides the hotel's pricing data and its derived datasets, BigQuery also contains other data sets such as travel search data we get from external parties and data generated by internal processes.

In addition to BigQuery we have a PostgreSQL database. This is another data warehouse hosted on Cloud SQL, where data sources such as Hubspot, Intercom, Zuora, Salesforce, static hotel data and subscription configuration data are centralised.

The difference between our BigQuery data warehouse and our PostgreSQL data warehouse is that the former contains the big volume datasets that are used in heavy batch processes that benefit significantly from the BigQuery scalable query engine.

The data in the PostgreSQL data warehouse contains smaller datasets coming from transactional databases. These are extensively modelled in our data warehouse by the data analytics team using DBT so they can easily be used for internal reporting.

Although we had been using BigQuery for a long time, in the last few years it became apparent that we needed dedicated tools to manage our BigQuery resources. This is due to several factors, two main ones being:

Scaling the team. We have scaled our engineering department considerably over the past few years (from 20 engineers in 2018 to 87 engineers in 2022). As a data company where many of the engineering teams use BigQuery in one way or another, the scaling caused ownership of specific BigQuery resources to become opaque.
Looker as a BI tool. We migrated from Chartio to Looker as our main BI tool and our implementation of Looker's access control to BigQuery data (explained below) required us to have version control on the many BigQuery views that were created. This version control is not provided by BigQuery, but can easily be achieved by having view queries in a version controlled repository in GitLab for example, which then can be established on BigQuery using continuous deployment.

What is Terraform?

Terraform is an Infrastructure as Code (IaC) tool that lets users define their cloud and on-prem resources in human-readable configuration files. These configuration files are used together with a state file keeping track of your infrastructure to create, modify or delete resources.

A big advantage of infrastructure as code is that you can version, reuse and share these configuration files. This makes infrastructure deployment and maintenance easier compared to manually deploying resources from a cloud Web UI.

Although Terraform is most often associated with deployments of low-level infrastructure such as compute and networking resources, it is also able to manage higher level infrastructure and fully managed tools, such as BigQuery. For BigQuery, projects, datasets, tables, views, routines, jobs and permissions all can be specified in the Terraform configuration files.

For datasets, the attributes "description" and "labels" can be very useful to track the ownership of datasets. For tables and views, the schema and queries can also be specified in the configuration files. Tracking these configuration files in git repositories allows users to use the common code review and versioning tools on modifications of the schemas and queries.

Terraform is not the only infrastructure as code tool that exists. Every major cloud provider has its own specific tool for its platform such as AWS Cloudformation, Azure Resource Manager or Google Cloud Deployment Manager.

Terraform however supports all prominent cloud platforms and providers in just one tool. As we were already using Terraform for managing other parts of our infrastructure and as Terraform has excellent BigQuery support, we decided to also use Terraform to manage our Looker-related BigQuery views.

Looker and the need for BQ views

When we migrated to Looker as our BI platform, we needed certain datasets from BigQuery to be incorporated in it. The problem was that we couldn’t reuse user permissions defined in BigQuery and the only way to have the data exposed was through a service account.

We made a decision to not give the Looker service account access to all raw data in BigQuery, but to work with a selection of preprocessed views. This was done for a few reasons.

Security. It is security best practice to give an application (and its users) only the least privilege access as to what is needed for its functionality.
BigQuery cost optimization. The intermediate view approach allows for optimization in BigQuery of how the data is accessed and cached by using e.g. materialized views.
Looker cost optimization. Developer accounts for LookML (Looker's "programming language") are expensive and LookML has a steep learning curve. We decided to only give the Data Analytics team LookML developer roles and all the other engineers the dashboard creator roles. This created a clear separation of responsibility and ownership concerning the data modelling process.
Avoiding bottlenecks. All engineers are able to model the data they are knowledgeable about and responsible for in a BigQuery view. This view is then simply linked with LookML by the Data Analytics team in an auto generated view and corresponding explores ready for visualisation.

In practice, we decided to organise all of our BigQuery views for Looker under one, new, central BigQuery project, to which the Looker service account has access.

For most databases, access to a view requires access to the underlying data source. In BigQuery however, one can make use of authorized views and authorised datasets to give a user or group access to just the query result. This way we are certain that the Looker service account (and the users in Looker) will not be able to access data other than what was selected for the view.

The below diagram is an adaptation of the LookML projects diagram from the LookML documentation showing the scope of data access and responsibility for our use case.

Technical details

Our repository describing the BigQuery infrastructure is hosted on GitLab and has one .tf file per dataset. This file contains the dataset definition and the definition of all views in that dataset.

An example of such a configuration file is shown below. The query of every view is stored in a separate .sql file in a 'queries' folder which allows for easier development in an IDE (use of SQL-specific tooling) and the possibility to use SQL linting (we use SQLFluff) in CI.

The Terraform state is stored within GitLab's managed Terraform HTTP backend which can be used both by developers on their local machine as well as our CI/CD pipelines on GitLab, which use the terraform plan and terraform apply commands against our infrastructure configuration.

resource "google_bigquery_dataset" "example_dataset_alias" {
 project     = "project_name"
 dataset_id  = "example_dataset"
 description = "This dataset is created as an example to demonstrate Terraform"
 location    = "EU"
}


resource "google_bigquery_table" "example_table_alias" {
 project             = "project_name"
 dataset_id          = google_bigquery_dataset.example_dataset_alias.dataset_id
 table_id            = "example_table"
 deletion_protection = false
 labels              = { user = "Name of the engineer or the engineering team responsible for the view" }
 description         = "This materialized view is created as an example to demonstrate Terraform"


 materialized_view {
   query = file("queries/example_dataset/example_query.sql")
 }


 time_partitioning {
   type  = "DAY"
   field = "partitioning_column_name"
 }


 clustering = ["clustering_column_name_0", "clustering_column_name_1"]

}

Evaluation

The use of Terraform for our Looker-related BigQuery views has made our development and deployment of those views particularly robust. Having the Terraform configuration files in GitLab allows us to use our standard code review process for the queries and makes changes easy to track and revert if necessary. Furthermore, ownership and discoverability of resources has improved by the inclusion of owner labels in the view definition.

We are happy with the benefits the Terraform setup for BigQuery brings us and have continued to add more of our BigQuery resources (such as common functions we once defined manually without version control) to be managed by Terraform.

DEV Community

Why we use Terraform for BigQuery

What is Terraform?

Looker and the need for BQ views

Technical details

Evaluation

Top comments (0)