DEV Community

Cover image for Mastering Big Data with GCP: My Capstone Journey in Cloud Data Analysis
Edwin Kinyao
Edwin Kinyao

Posted on

2 1 1 1 1

Mastering Big Data with GCP: My Capstone Journey in Cloud Data Analysis

Introduction

As a data enthusiast, I’ve always been fascinated by the power of cloud platforms to transform raw data into actionable insights. Recently, I completed a capstone project using Google Cloud Platform (GCP) that put my skills to the test. My task? Help a fictional fintech startup, TheLook Fintech, leverage BigQuery and Looker to tackle critical business questions about loan performance and borrower behavior. In this blog, I’ll walk you through my journey—from collecting and processing data to building a sleek dashboard—and share the lessons I learned along the way.

Whether you’re a data analyst, a cloud newbie, or just curious about BigQuery and Looker, this post will give you a front-row seat to a real-world data project.


The Scenario: A Fintech Startup’s Data Challenge

Imagine you’re a cloud data analyst hired by TheLook Fintech, a growth-stage startup revolutionizing loans for online store owners. The Treasury department, led by Trevor, needs your help to monitor cash flow, understand why customers borrow, and track loan distribution across regions. Later, they want a dashboard to keep tabs on loan health. My mission was clear: use GCP tools to collect, process, and analyze data, then visualize the results.

The project unfolded in two parts:

  1. BigQuery Workflow: Collecting, processing, and storing loan data to answer three key questions.
  2. Looker Dashboard: Building visualizations to monitor loan health metrics.

Here’s how I tackled it.


Part 1: Collecting, Processing, and Storing Data in BigQuery

The first leg of the project was all about getting hands-on with BigQuery, GCP’s serverless data warehouse. My goal was to answer three business questions:

  • How can we monitor cash flow to ensure loan funding doesn’t exceed incoming payments?
  • What are the top reasons customers take out loans?
  • Where are borrowers located geographically?

Step 1: Setting Up the BigQuery Environment

I started by creating a BigQuery dataset to house the loan data. This involved setting up tables and ensuring the schema aligned with the fintech’s needs—think columns for loan amounts, purposes, dates, and borrower locations.

Image description

Step 2: Exploring the Loan Data

With the data loaded, I ran exploratory SQL queries to get a feel for it.

Image description

For cash flow, I calculated money in (loan repayments) versus money out (loan issuances). For loan purposes, I dug into a nested field in the application data to extract reasons like “inventory purchase” or “business expansion.”

Image description

Image description

For locations, I aggregated loans by state.

Step 3: Importing Additional Data

Trevor needed a deeper geographic breakdown, so I imported a CSV file with state classifications into BigQuery. I converted this into a standard table using a CREATE TABLE AS SELECT statement—a simple satisfying tactic.

Image description

Step 4: Joining Tables

Next, I joined the loan data with the state classification table using a JOIN clause in SQL. This enriched the dataset, letting me map loans to regions and spot geographic trends.

Image description

Step 5: Cleaning Up with Deduplication

The loan purpose data had duplicates , so I used a DISTINCT query to clean it up. This ensured accurate reporting on why borrowers were seeking funds.

Step 6: Aggregating Loan Amounts by Year

Finally, I created a table with a GROUP BY query to sum loan amounts by issuance date and year. This gave Trevor a clear view of lending trends over time—crucial for cash flow monitoring.

Image description

By the end, I had a polished dataset ready for analysis, stored efficiently in BigQuery.


Part 2: Visualizing Insights with Looker Enterprise

With the data prepped, Trevor threw a new challenge my way: create a dashboard in Looker to track loan health. He wanted answers to four questions:

  • What’s the total outstanding loan amount?
  • What percentage of loans fall into each status (e.g., current, late, default)?
  • Which states have the most outstanding loans?
  • Which customers own their homes outright and have current loans?

Task 1: Getting Started with Looker

I kicked things off by connecting Looker to my BigQuery dataset. Looker’s intuitive interface made it easy to define a data model that mapped to my tables.

Task 2: Total Outstanding Loan Amount

For the first visualization, I built a single-value card showing the sum of all outstanding balances. A quick SUM measure in LookML, paired with a filter for unpaid loans, did the trick.

Image description

Task 3: Loan Status Breakdown

Next, I created a pie chart to display the percentage of loans by status. I grouped the data by categories like then used Looker’s percentage calculation to show the distribution. This was a game-changer for spotting risk areas.

Image description

Task 4: Top States with Outstanding Loans

I crafted a bar chart highlighting the top 10 states by loan count. A COUNT measure, sorted in descending order, and a limit of 10 gave Trevor a clear view of geographic concentration.

Image description

Task 5: Homeowners with Current Loans

For the final visualization, I built a table listing customers who own their homes outright and have “Current” loans. I filtered by homeownership status and loan status, then sorted by income to spotlight high earners.

Image description

Task 6: Polishing the Dashboard

To make the dashboard interactive, I enabled cross-filtering—clicking a state in the bar chart filters the other visuals. I also set a daily refresh rate to keep the data fresh. The result? A sleek, user-friendly tool Trevor’s team could rely on.

Image description


The Final Dashboard

Here’s what the dashboard looked like:

  • Card: Total outstanding amount ($3.08B).
  • Pie Chart: Loan status percentages.
  • Bar Chart: Top 10 states by loan count.
  • Table: Homeowning customers with current loans.

Image description

It was a proud moment seeing it all come together—a testament to the power of combining BigQuery’s data crunching with Looker’s visualization prowess.


Conclusion

My journey with TheLook Fintech’s data was a crash course in using GCP to tackle real-world challenges. BigQuery made it easy to handle large datasets, while Looker brought the insights to life. If you’re looking to break into cloud data analysis, I can’t recommend this kind of hands-on project enough—it’s the perfect way to build skills and confidence.

I’m excited to explore more advanced GCP features like Dataflow or AI Platform. For now, I’d love to hear your thoughts—have you worked with BigQuery or Looker? Drop a comment below!


Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someone’s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay