DEV Community

loading...
Cover image for Using BigQuery Public Datasets to research the impact of COVID-19 🦠
Cube.js

Using BigQuery Public Datasets to research the impact of COVID-19 🦠

igorlukanin profile image Igor Lukanin ・10 min read

TL;DR: In this tutorial, we'll explore how to build an analytical application on top of Google BigQuery, a serverless data warehouse, and use a few public datasets to visualize the impact of the COVID-19 pandemic on people's lives. You'll also learn more about habits in your country and other countries.

Long story short, here's the application we're going to build:

Alt Text

And not only for the United States but for every country.

What is Google BigQuery?

BigQuery is a serverless big data warehouse available as a part of Google Cloud Platform. It's highly scalable, meaning that it can process tiny datasets as well as petabytes of data in seconds, using more cloud capacity as needed. (However, due to BigQuery's distributed architecture, you can't possibly expect it to have a sub-second query response time.)

BigQuery has a gentle learning curve, in part due to its excellent support for SQL, although (big surprise!) we won't be writing SQL queries in this tutorial.

BigQuery also has a free usage tier: you get up to 1 TB of processed data per month and some free credits to spend on Google Cloud during the first 90 days. You can probably guess that BigQuery is billed by the amount of processed data.

Alt Text

BigQuery web console in Google Cloud, with the most important information being: "Query complete (2.3 sec elapsed, 2.1 GB processed)."

So, let's see what datasets are waiting to be explored! 🦠

What are BigQuery Public Datasets?

BigQuery public datasets are made available without any restrictions to all Google Cloud users. Google pays for the storage of these datasets. You can use them to learn how to work with BigQuery or even build your application on top of them, exactly as we're going to do.

You could query them just if they were your own. However, always pay attention to the license and other relevant info, like update frequency and last update date. Unfortunately, some datasets are really outdated:

Alt Text

So, what to expect? Some of these 212 public datasets are quite interesting:

COVID-19 Government Response Tracker. This dataset is maintained by the University of Oxford Blavatnik School of Government. It tracks policy responses to COVID-19 from governments around the world. Basically, all lockdowns, curfews, and workplace closures worldwide are registered in this dataset.

Google Community Mobility Reports. This dataset is maintained by Google. It provides insights into what has changed in people's habits and behavior in response to policies aimed at combating COVID-19. It reports movement trends over time by geography, across different retail and recreation categories, groceries and pharmacies, parks, transit stations, workplaces, and residential.

We can use both datasets to visualize and correlate the time measures against COVID-19 with changes in social mobility. Here's how it might look like:

Alt Text

For that, we need to create an analytical API over BigQuery and a web application talking to that API. So, let's get hacking! 🦠

How to Create an Analytical API

Why do we need an API in the first place?

The most obvious reason is that BigQuery can't provide a sub-second query response time, meaning that an application that talks directly to BigQuery will have a suboptimal user experience. Also, BigQuery bills you by the amount of transferred data, so if you have a popular app, you might suddenly know about that from a billing alert.

Also, direct interaction with BigQuery means that you'll need to write SQL queries. There's nothing wrong with SQL; it's a great domain-specific language, but having SQL queries all over your codebase smells like a leaky abstraction — your application layers will know about column names and data types in your database.

So, what are we going to do? In this tutorial, we'll use Cube.js:

GitHub logo cube-js / cube.js

📊 Cube.js — Open-Source Analytical API Platform

Cube.js is an open-source analytical API platform, and it allows you to create an API over any database, BigQuery included.

Cube.js provides an abstraction called a "semantic layer," or a "data schema," which encapsulates database-specific things, generates SQL queries for you, and lets you use high-level, domain-specific identifiers to work with data.

Also, Cube.js has a built-in caching layer that provides predictable, low-latency response query times. It means that an API built with Cube.js is a perfect middleware between your database and your analytical app.

Let's try it in action.

The first step is to create a new Cube.js project. Here, I assume that you already have Node.js installed on your machine. Note that you can also use Docker to run Cube.js. Run in your console:

npx cubejs-cli create bigquery-public-datasets -d bigquery
Enter fullscreen mode Exit fullscreen mode

Now you have your new Cube.js project in the bigquery-public-datasets folder containing a few files. Let's navigate to this folder.

The second step is to add BigQuery and Google Cloud credentials to the .env file. Cube.js will pick up its configuration options from this file. Make sure your .env file looks like this:

# Cube.js environment variables: https://cube.dev/docs/reference/environment-variables

CUBEJS_DB_TYPE=bigquery
CUBEJS_DB_BQ_PROJECT_ID=your-project-id
CUBEJS_DB_BQ_KEY_FILE=./your-key-file-name.json

CUBEJS_DEV_MODE=true
CUBEJS_API_SECRET=SECRET
Enter fullscreen mode Exit fullscreen mode

Here's what all these options mean and how to fill them:

  • Obviously, CUBEJS_DB_TYPE says we'll be connecting to BigQuery.
  • CUBEJS_DB_BQ_PROJECT_ID should be set to the identifier of your project in Google Cloud. First, go to the web console of Google Cloud. Create an account if you don't have one. Then go to the new project creation page and create one. Your project identifier is just below the name text field:

Alt Text

  • CUBEJS_DB_BQ_KEY_FILE should be set to the key file name for your Google Cloud user that will connect to BigQuery. It's better to use a service account, a special kind of Google Cloud account with restricted access. Go to the service account creation page and create one. On the second step, you'll be asked to specify the roles for this service account. The only roles needed for read-only access to public datasets are BigQuery Data Viewer and BigQuery Job User. After the user is created, you need to add a new authentication key — use the ... button on the right to manage the keys for this account and add a new one of JSON type. The key file will be automatically downloaded to your machine. Please put it in the bigquery-public-datasets folder and update your .env file with its name.

Alt Text

  • The rest of the options configure Cube.js and have nothing to do with BigQuery. Save your .env file.

The third step is to start Cube.js. Run in your console:

npm run dev
Enter fullscreen mode Exit fullscreen mode

And that's it! Here's what you should see:

Alt Text

Great, the API is up and running. Let's describe our data! 🦠

How to Define a Data Schema

Before we can explore the data, we need to describe it with a data schema. The data schema is a high-level domain-specific description of your data. It allows you to skip writing SQL queries and rely on Cube.js query generation engine.

Create two schema files with the following contents: take schema/Measures.js from this file, and schema/Mobility.js from that file. Here is a redacted version of the first file with a few interesting things:

cube(`Measures`, {
  sql: '
    SELECT *
    FROM `bigquery-public-data.covid19_govt_response.oxford_policy_tracker`
  ',

  measures: {
    confirmed_cases: {
      sql: `confirmed_cases`,
      type: `max`
    },

    cancelPublicEvents: {
      sql: `CAST(cancel_public_events AS NUMERIC)`,
      type: `max`
    },
  },

  dimensions: {
    country: {
      sql: `country_name`,
      type: `string`
    },

    date: {
      sql: `TIMESTAMP(${Measures}.date)`,
      type: `time`
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

Note that:

  • in this data schema, you describe an analytical cube
  • it contains the data retrieved via a straightforward sql query
  • you can define measures, i.e., numerical values to be calculated
  • measures are calculated using various functions, such as max
  • you can define dimensions, i.e., attributes for which the measures are calculated
  • dimensions can have different data types, such as string or time
  • in measure and dimension definitions, you can use BigQuery functions, e.g., CAST(... AS NUMERIC) or TIMESTAMP

And here's a part of another file:

cube(`Mobility`, {
  sql: '
    SELECT *
    FROM `bigquery-public-data.covid19_google_mobility.mobility_report`
  ',

  measures: {

  },

  dimensions: {

  },

  joins: {
    Measures: {
      sql: `${Measures}.country_name = ${Mobility}.country_region AND
            ${Measures}.date = ${Mobility}.date`,
      relationship: `hasOne`,
    }
  }
});
Enter fullscreen mode Exit fullscreen mode

Here you can see that our two cubes, based on different tables from different BigQuery datasets, are joined together with join, where a join condition is provided as an SQL statement. Cube.js takes care of the rest.

Now we have the data schema in place, and we can explore the data! 🦠

How to Explore the Data

As the console output suggests, let's navigate to localhost:4000 and behold Cube.js Developer Playground. It provides a lot of features, but we have a clear path to follow. Let's build a query.

To do so, go to the "Build" tab, click "+ Measure," and select a measure. For example, select Measures Confirmed Cases. As you can see, the Measures Date time dimension has been automatically selected, and the chart below displays the count of confirmed COVID-19 cases over time. You can even "+ Filter" by Measures Country, use the "equals" condition, and put your own country's name into the filter field. Looks interesting, right?

Alt Text

Definitely feel free to experiment and try your own queries, measures, dimensions, time dimensions, granularities, and filters.

After that, let's move on and build an analytical app! 🦠

How to Build an Analytical App

It's worth noting that Cube.js Developer Playground has one more feature to explore.

If you go to the "Dashboard App" tab, you'll be able to generate the code for a front-end application with a dashboard. There're various templates for different frameworks (React and Angular included) and charting libraries there. Still, you can always choose to "create your own," and if you choose a "dynamic" template, you'll be able to compose queries and add charts just like you did.

Alt Text

However, we'll choose a much simpler way to go from zero to a full-fledged analytical app — we'll grab the code from GitHub:

  • first, download this dashboard-app.zip file
  • unzip it to your bigquery-public-datasets folder
  • run yarn and yarn start (or npm install and npm start)

You should be all set! Navigate to localhost:3000 and have a look at this app:

Alt Text

Choose your country and take your time to explore the impact of COVID-19 and how mitigation measures correlate with social mobility.

Let's take Israel. You can clearly see three waves and the positive effect of "stay at home" requirements — after they are introduced, every wave spreads with lesser speed.

Alt Text

Let's take Germany. You can see how Germans interact with the rules: after the first "stay at home" requirements are lifted, park activity grows, and after the second "stay at home" requirements are introduced, parks instantly become deserted.

Alt Text

Let's take Singapore. Obviously enough, you can see Singapore doing a great job containing the virus. The third wave is nearly unexistent.

Alt Text

What are your own insights? Please share them in the comments!

And now, let's explore a few crucial parts of this app to understand better how it works and, more specifically, how it retrieves data from Cube.js API.

First, as you can see from package.json, it's obviously a React app created with the create-react-app utility. It has an index.js as an entry point and the App root component.

Second, it references @cubejs-client/core and @cubejs-client/react packages as dependencies. Here's what you can see in the api.js file:

// Let's use Cube.js client library to talk to Cube.js API
import cubejs from '@cubejs-client/core'

// API URL and authentication token are stored in .env file 
const cubejsApi = cubejs(
    process.env.REACT_APP_CUBEJS_TOKEN,
    { apiUrl: `${process.env.REACT_APP_API_URL}/cubejs-api/v1` }
);

// The simplest Cube.js query possible:
// "Hey, Cube.js, give us a list of all countries."
const countriesQuery = {
    dimensions: [ 'Mobility.country' ]
}

export function loadCountries(callback) {
    // cubejsApi.load returns a promise.
    // Once it's resolved, we can get the result.
    // We can even transform it with tablePivot() or chartPivot()
    cubejsApi
        .load(countriesQuery)
        .then(result => {
            const countries = result
                .tablePivot()
                .map(row => row['Mobility.country'])

            callback(countries)
        })
}
Enter fullscreen mode Exit fullscreen mode

Believe it or not, that's the bare minimum we should know about working with Cube.js REST API in the front-end apps. You import a client library, you compose your query as a JSON object, you load the result asynchronously, and you do whatever you want with the data.

In this application, the data is visualized with Chart.js, a great data visualization library. However, you can choose any library you're familiar with. And maybe your app will look even better than this one:

Alt Text

And that's all, folks! 🦠 I hope you liked this tutorial 🤗

Here's just a few things you can do in the end:

  • go to the Cube.js repo on GitHub and give it a star ⭐️
  • share a link to this tutorial on social media or with a friend 🙋‍♀️
  • share your insights about the impact of COVID-19 in the comments ↓

Discussion (0)

pic
Editor guide