In this module, I'm going to take the data I wrote in the last post, "API to GCP with Mage", from Google Cloud, transform it, and write it to Google Big Query. I'm writing up these tasks for the "learning in public" part of the DataTalksClub data engineering zoomcamp.
To review: I set up a VM instance in Google Cloud, downloaded Mage from github, and I'm running it in Docker. On my computer, I can access it through localhost:6789, which I have defined in VSCode. GCP credentials were set up with a service account on Google Cloud and are accessed through a variable in the Mage config file, io_config.yaml. I describe these steps in the post "Configuring Google cloud storage for Mage".
First, I created a new batch pipeline and named it gcs_to_bigquery. I created a Data loader block called load_taxi_gcs. In that file, I added my bucket name and for the object key, I put in 'nyc_taxi_data.parquet'.
Then I added a Transformer block, called transform_staged_data. Here, the instructor substituted underscores for spaces in the column names, and made the column names lowercase. However there weren't any column names with spaces, and the homework explicitly states that were supposed to transform VendorID to vendor_id. So I did that step differently.
Finally I created a Data exporter in SQL called "write_taxi_to_bigquery". I set the connection to BigQuery, used the profile "default", named the schema "ny_taxi" and the table "yellow_taxi_data". These settings are all part of the Data exporter block menu at the top of the file. I ran all three blocks, and created a table in BigQuery, which I could see in my Google Cloud console.
If this sounds simple, that's because it is. Once I have the environment set up and the docker containers running, creating blocks and running them is intuitive.
Top comments (0)