Thomas Brittain

Posted on Nov 5, 2022 • Edited on Nov 9, 2022 • Originally published at ladvien.com

Let's Build a Data Warehouse

#datawarehouse #healthkit #dbt #aws

The Reason

I am the lead data engineer at Bitfocus, the best SaaS provider for homeless management information systems (HMIS).

For many reasons, I've advocated with our executives to switch our analytics flow to use a data warehouse. Strangely, it worked. I'm now positioned to design the beast. This series will be my humble attempt to journal everything I learn along the way.

The Plan

I've decided it would be better if I learned how to build an analytics warehouse on my own, before committing our entire business to an untested technology stack. Of course, I needed a project similar enough it had the same challenges. I've decided to expose my Apple HealthKit data in a business intelligence platform like Looker.

The Stack

I've attempted similar health data projects before:

Give Me MyFitness Pal Data!

But I'm going to be a bit more ambitious this time. I'd like to get all of the data generated by my Apple Watch and iPhone out of HealthKit and send them to a SQL database inside AWS.

The ingestion of these data will be lightly structured. That is, no transformations will be done until the data are in the database. I'll use the Auto Health Export iOS app to send data to a web API endpoint.

The web API endpoint will be created using the Severless Framework. This should allow me to build a REST API on top of AWS' Lambda functions, enabling the Auto Health Export app to synchronize data to the database periodically (as low as every 5 minutes, but dictated by Apple).

Once the data are stored in a database, I'll use Dbt (Data Build Tool) to transform the data for exposing them in the business intelligence tool. This idea of processing data inside a database is often referred to as "extract, load, transform" or "ELT," which is becoming standard versus the classical "extract, transform, load" or "ETL."

After transformation, I'll use Dbt to shove the data back into a database built for analytics query processing. Aka, a "data warehouse." The most popular analytics database technologies right now are Snowflake and Redshift. I'll use neither. I can't afford them and I despise how pushy their salespeople are. Instead, I'll use Postgres. (It's probably what Snowflake and Redshift are built on anyway.)

Lastly, I'll stand up Lightdash as the business intelligence tool.

Tools

Auto Health Export

As mentioned, I've attempted to pull data from the Apple ecosystem several times in the past. One challenge I had was writing an iOS app to pull HealthKit data from an Apple device and send it to a REST API. It's not hard. I'm just not an iOS developer, making it time-consuming. And I'm at the point in my tech career I'm comfortable purchasing others' solutions to my problems.

Anyway, not a perfect app, but it does have the ability to pull your recent HealthKit data and send it to a REST API endpoint.

Auto Health Export

Serverless

The team creating our SaaS product has talked about going "serverless" for a while. In an attempt to keep up, I decided to focus on creating Lambda and API Gateway services to receive the data from the Auto Health Export app.

While I was researching serverless architecture I ran into the Serverless Framework. I quite like it. It allows you to focus on the code of Lambda, as it will build out the needed infrastructure on deployment.

Serverless

AWS' Lambda

The heart of the ingestion. This method will receive the health data as JSON and store it in the Postgres database for transformation.

AWS' Lambda

Postgres

I've selected the Postgres database after researching. I'll detail my reason for selecting Postgres later in the series.

Regardless of tech choice, this database will hold untransformed data, or "raw data," as well as act as the processing engine for transforming the data before moving it into the analytics warehouse.

Postgres >=9.5

Dbt

Data Build Tool, or Dbt, is quickly becoming the de facto tool for transforming raw data into an analytics data warehouse. At its core, it uses SQL and Jinja to enable consistent and powerful transformation "models." These models rely on a scalable SQL database, known as a "processing engine," to transform the data before sending it on to its final destination, the analytics warehouse.

MariaDB with ColumnStore

I'd like to use MariaDB as the actual analytics data warehouse. It is an unconventional choice compared to Snowflake or Redshift. But, I'm attempting to use open-source software (OSS) as much as possible, as our company is having a horrific experience with the downfall of Looker.

One of the better-kept secrets about MariaDB is its ColumnStore engine. It allows three major features which have convinced me to try it out:

It stores data in a column, making it faster for analytics
The data store is distributed among S3 buckets
Data are run-length encoded, greatly reducing storage size
ColumnStore engine tables have join capability with InnoDB tables

I'm also curious whether MariaDB could possibly be used as a processing engine, as I'd prefer to reduce the cognitive complexity of the stack by having only one SQL dialect.

MariaDB >=10.5.4

Lightdash

At this point, you probably got I'm not happy with what Google has done to Looker. That stated, Looker is still my favorite business intelligence tool. Luckily, there is a budding Looker alternative that meets my OSS requirement: Lightdash. I've never used it, let alone deployed it into production. I guess we'll see how it goes.

Lightdash

Terraform

You may notice, a data warehouse requires a lot of infrastructure. I hate spinning-up infrastructure through a UI. It isn't repeatable, it's not in version control, and I like writing code. Luckily, Hashicorp's got my back with Terraform. It allows defining and deploying infrastructure as code (IaC).

Terraform

Terragrunt

Terragrunt is a wonderful addition to Terraform. It allows an entire company's infrastructure to be easily maintained.

Terragrunt

The Beginning

First up, let's take a look at the Auto Health Export app and our Serverless architecture. A couple of personal notes before jumping in.

I'm writing this series as a journal of what I learn. That stated, be aware I may switch solutions anywhere in our stack, at any time.

Also, I know a lot of stuff. But there's more stuff I don't know. If I make a mistake, please let me know in the comments. All feedback is appreciated, but respectful feedback is adored.

DEV Community

Let's Build a Data Warehouse

The Reason

The Plan

The Stack

Tools

Auto Health Export

Serverless

AWS' Lambda

Postgres

Dbt

MariaDB with ColumnStore

Lightdash

Terraform

Terragrunt

The Beginning

Top comments (0)

Read next

Amazon EC2 now supports to Optimize CPUs post instance launch

Building with Generative AI on AWS : PartyRock, Amazon Bedrock, and Retrieval-Augmented Generation (RAG)

Building a Serverless Recipe Generator : AWS Project

How to Migrate from Serverless Framework to AWS SAM