This is week 3 of the data engineering zoomcamp. We started with an overview of two types of data processing, OLAP and OLTP. OLAP stands for Online Analytical Processing and OLTP stands for Online Transactionn Processing. OLTP is used to run essential business operations in real time. It has a steady feed of data, initiated by users. The data is normalized, meaning organized to avoid redundancy. The tables are smaller and more quickly searchable.
OLAP is what we're looking at with the New York taxi data. There is a huge amount of data, not normalized. It's updated over longer periods with batch updates. The data is of interest to data analysts and data scientists.
A data warehouse is an OLAP solution. The way it works is you gather data from many sources - csv files, other databases, etc. You have a staging area where you transform the data, and then you put it in a data warehouse. The data warehouse then outputs various "data marts" which have different uses-- analysis, reporting, data mining for example.
BigQuery is a serverless data warehouse. BigQuery being "serverless" means that users don't need to manage the underlying infrastructure or provision servers to run queries or manage data storage. BigQuery automatically scales resources to handle varying workloads. It can handle small queries from individual users to large-scale analytical workloads from enterprise applications without the need for manual intervention. This scalability ensures that users can run queries efficiently regardless of the size of the dataset.
BigQuery also stores public data. For example, it stores the GDELT database. GDELT stands for "Global Database of Events, Language, and Tone." It is a project that monitors the world's news media from nearly every corner of every country in multiple languages. GDELT analyzes the text of news articles, broadcasts, and other sources to identify events, people, organizations, locations, and themes, providing a comprehensive database for understanding global events, trends, and narratives over time. There's no simple way to access this data. I managed to find the GDELT data, and many others, but it's not straightforward. I suggest searching for "Google Public Datasets" and going from there.
BigQuery has two kinds of pricing. The first is On Demand pricing, and the second is Flat Rate pricing. We will use On Demand pricing, meaning that we're charged only for resources that we actually use.
If we have data in a GCP storage bucket, we can read it into BigQuery using a SQL command. In that case, we don't have to specify the schema. BigQuery knows what the fields are. I tried this myself. I have a project already set up on Google Cloud, called data-engineering-2024-411821. First I created a bucket using Terraform. I had already done this once, but couldn't remember how to authenticate. I went back to the course and found that I had to authenticate using the terminal command gcloud auth application-default login
. After I did this, I ran terraform apply
to create a new bucket. I supposed I could have done this by hand, but it was good to review. I uploaded one .csv file into the bucket, yellow_tripdata_2021-01.csv. Then in the BigQuery query section I ran the command from week 3:
CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata`
OPTIONS (
format = 'CSV',
uris = ['gs://nyc-tl-data/trip data/yellow_tripdata_2019-*.csv', 'gs://nyc-tl-data/trip data/yellow_tripdata_2020-*.csv']
);
I had to use my own database project id and table names:
CREATE OR REPLACE EXTERNAL TABLE `data-engineering-2024-411821.ny_taxi.external_yellow_tripdata`
OPTIONS (
format = 'CSV',
uris = ['gs://data-engineering-2024-411821-bucket/yellow_tripdata_2021-*.csv']
);
And it worked the first time! Very exciting. Just like the instructor said in the video, BigQuery takes care of the schema. The fields did not look excessive, for example the integer fields were INTEGER and not BIGINT and the floating point fields were just FLOAT. The pickup and dropoff were TIMESTAMP and not a string the way pandas interpreted it.
This table is an external table that doesn't actually live on BigQuery. The data is stored in a Google Cloud Storage bucket. Next I'll make some tables that are stored in BigQuery.
Partitioning can really improve BigQuery performance. In most cases, we will want to partition by date, in order to speed up queries that occur on a particular date. This reduces our cost, because we are processing less data.
To test this, I copied the external table to a "materialized" table in BigQuery, first as a non-partitioned table, then as a partitioned table.
CREATE OR REPLACE TABLE data-engineering-2024-411821.ny_taxi.yellow_tripdata_non_partitoned
SELECT * FROM data-engineering-2024-411821.ny_taxi.external_yellow_tripdata;
CREATE OR REPLACE TABLE data-engineering-2024-411821.ny_taxi.yellow_tripdata_partitoned
PARTITION BY
DATE(tpep_pickup_datetime) AS
SELECT * FROM data-engineering-2024-411821.ny_taxi.external_yellow_tripdata;
Since I only had one month worth of data, I tested these tables using a query on one day.
SELECT DISTINCT(VendorID)
FROM data-engineering-2024-411821.ny_taxi.yellow_tripdata_non_partitoned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2021-01-02' AND '2021-01-03';
This took 806 msec and processed 20.15M of data. When I ran the query on the partitioned table, it took 96 msec and processed less than 1M. So there's definitely a savings here.
We can also get information about the partitions. Enter the query
SELECT table_name, partition_id, total_rows
FROM `ny_taxi.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'yellow_tripdata_partitoned'
ORDER BY total_rows DESC;
This will show the number of rows in each partition starting with the one with the most trips. In this way we can also check for bias, in which one partition has many more rows than another.
The rest of the video talks about clustering, which is another subdivision of a table. However I will wait to talk about that until my next post, which will be about partitioning and clustering in BigQuery.
Top comments (0)