Guillaume Legoy

Posted on Jan 16, 2021 • Originally published at boringdatascience.com

Taking Ownership of Our Data Pipelines: Analytics Engineering Using dbt

#datascience #sql #tutorial #database

Analytics engineering: a new paradigm to data extraction and manipulation for analysts and data scientists

One of the main component of our jobs as analysts/data scientists is to write SQL. In order to find insights and generate value we query databases several times a day. And as data is messy, poorly documented and sometimes even wrong or missing, writing SQL becomes a time consuming endeavor and an on-going iterative process. Often we also have to repeat the same analysis and data extraction process over time to answer similar business needs. We always swear we are someday going to automate these processes with the help of some beautiful Python script but never end up doing it by lack of time (I know, I've been there). Or sometimes we don't have the data in the format we need and have to ask the database engineers to prepare specific tables for us, meaning more delays and unhappy business stakeholders. Before I present you with a solution and concrete examples, let's highlight the main pitfalls of our SQL activities:

Lack of proper version control: scripts sleep in our computer with beautiful names like analysis_finalv3.2_product.sql, and at best in a remote repository that's kind of maintained.
No automation: recreating an analysis is manual and time consuming.
Poor collaboration: because our SQL is not optimal and partly hidden, other analysts can't access it and have to copy paste or reinvent the wheel.
No testing: we don't really know if the data we queried 3 months ago is still valid or behaves the same way. Even worse, hidden bugs like duplicated rows could make our aggregate reporting inaccurate, hurting the business.
No documentation: this bug in the data we have hardcoded in a personal script and not made available to others means the same report from the same data can return different numbers.

All of the above can create one of the worse situation you can encounter as a data scientist: a lack of trust in the data from business stakeholders who depend on your analysis to take informed decisions.

But what's the solution? I'm glad you asked because let me introduce you to the work of the analytics engineer. Behind this somewhat fancy title lies one of the core competency of any good data analyst/scientist: autonomy. I believe as data professionals we need to take ownership for our own data processing, from the raw data to the end result (a report, a visualization, or even machine learning). We must not solely rely on software engineers to do the dirty work for us. While it is extremely difficult to manage databases and we still need engineers to make sure the data is actually available and the infrastructure robust there is however a lot we can and should do by ourselves, even with messy, unorganized data. This is the T in ETL: we transform existing data into something actionable. Analytics engineering is therefore not a completely different discipline, it's simply an extension of our current skill set that makes us more independent. And the beauty of it is that we only need a working knowledge of SQL to get started. If you want to read more about the analytics engineer, head over here.

Introducing dbt

dbt (data build tool) is an open source python library that leverage the power of SQL to perform data transformation. It requires an existing data warehouse and from there allow you to write code (in SQL) that in turn gets compiled and finally run against your database. It therefore assumes more of an ELT structure where the Extraction and Loading of the data happens beforehand and leaves you with the Transform step. For more information read this. Let's now see how it works in practice. dbt is extremely rich in features so I'm going to cover the basics first in this post and expand in later articles for the more in-depth functionalities. Before we dive in let's highlight some of the benefits dbt provides:

Simplicity. As we will see it hides most of the complicated steps needed to create tables and allows us to focus on delivering value through data transformations.
Version control. No more copy pasting, business logic is centralized and shared across the organization.
Environment management. Managing different environments and accounts (dev, testing, staging, production) is a breeze through the use of profiles to access the database and schemas.
Automation. By using what dbt calls macros we can abstract functionalities and bits of SQL instead of repeating them in different places of our codebase. This makes us respect the DRY (Don't Repeat Yourself) principle and saves a lot of time and headaches.
Testing capabilities. The last but probably one of the most important. We data scientists and analysts don't usually test our SQL code because it's complicated to do so in an automated way. We usually just check a subset of our SQL result is ok and we call it a day. Small issues are hard to find (like duplicated rows, or a missing value there and there) but can have dire consequences. dbt provides testing tools out of the box. It's so important and useful I will dedicate a whole article to it later on. But if you already wanna read about it, check here.

Requirements

Python 3, follow these instructions if you don't have it.
Pipenv to manage Python dependencies. I wrote an introductory post about pipenv and the installation procedure is available in the Code Vault.
dbt. Simply run the following in your project folder:

  pipenv install dbt

Finally, it's difficult to explain a new concept without a concrete example and visuals of what's happening. In this post I'll use a pet project of mine you can find and clone at this repo link. The idea is to create a small database of cards from the newly released game Legends of Runeterra made by the Riot company, and to create a new transformed table out of it for analysis purpose. You don't have to understand or even be interested in any of it to follow the steps I present below. Finally in parallel I created a database on AWS in order to host our data. You can't access it but I'll show you what's happening with screenshots.

Breaking down a dbt command & main concepts

In order to understand how dbt works we are going to break down its main command used to execute code and transform data inside your warehouse. Then for each concept we will review how it works and when it can be used, and in the end we will see it in action (through screenshots of code executed against my personal database). While dbt offers many features we won't cover for now, I hope that by the end of this article you'll understand how all of this works together and be able to get started by yourself (and with the help of dbt great documentation).

Here is a typical dbt command. don't worry if you don't understand any of it for now, simply remember that dbt takes raw data, applies your transformations, and materializes them into new tables or views:

  pipenv run dbt run --profiles-dir=. --target=profile_dev --full-refresh --models region_keyword_distributions

And the below screenshot represents the usual dbt project structure that can be found in my pet project in explorer/dbt/ (you can generate the same structure by running dbt init [project-name] in a new folder):

Note: for better understanding and clarity every file path I mention below is relative to the dbt folder (in short I omit the explorer/dbt part and assume we already are in this folder).

Targets and profiles

dbt uses targets located inside profiles to determine where our generated SQL will be stored. Simply put we ask dbt to send the result of our transformations to a specific database/schema. Targets and profiles are handled by profiles.yml. What is this weird .yml extension you are asking? dbt indeed uses YAML to handle configuration files like this one. YAML is a kind of json. The entire project is therefore a mix of .sql and .yml files. By the way, even though you can see this profiles.yml file in the above screenshot of the project structure, it's not available in the Github repository, because this file contains usernames and passwords. So except if you are using environment variables placeholders, never ever push this file to a remote Github repository. Seriously. Your IT department is going to be mad and we don't want this. Simply add it to .gitignore so it's ignored by git. This is how the file looks like:

  explorer:
    target: profile_dev
    outputs:
      profile_dev:
        dbname: XXXXX
        host: XXXXXXX.us-east-1.rds.amazonaws.com
        pass: 'XXXXXX'
        port: 5432
        schema: 'XXXXXX'
        threads: 4
        type: redshift
        user: 'XXXXXX'

This YAML file is what the --profiles-dir=. --target=profile_dev refers to in the dbt command above. It contains the database credentials for the profile_dev target that is part of the explorer profile, and which is also the default one as specified by the target key. In outputs you can have as many profiles as you want, which is often the case in a "real" setting: one for each of dev, unit tests, integration tests and production for instance. The first part --profiles-dir=. when executed in the dbt folder tells dbt to look for the profile file in . which means the current directory we are already in. You can change this depending where profiles.py file is located. If not specified it will look by default in ~/.dbt/ on your machine. However I prefer to have it as part of my project structure as it makes it easier to manager in production (but remember, do not push passwords remotely!). Last step, you need to specify which profile dbt should be using in dbt_project.yml under the profile key.

This may look a bit cumbersome to setup, and it's definitely overkill in my simple project example, but it really shines in a real production environment or collaborative project. The ability to have multiple targets and profiles is really helpful to manage environments as you can execute the same SQL across 2 different schemas (test and production for instance) by simply changing it in the command line.

Models and Macros

Models are the bread and butter of your dbt projects. This is where dbt is truly amazing because everything is a SELECT. You read it right, no need to write complex CREATE statements, DELETE records, or dropping tables. dbt handles it all for you. You can therefore focus on the parts that generate value: transforming data and answering business needs. Models live in the models folder, where dbt automatically finds them upon execution if not specified. Thus the --models region_keyword_distributions part of our original command is to tell dbt to execute the region_keyword_distributions model only. Leaving this argument empty automatically executes every model found in the models folder. In our example project, we have region_keyword_distributions located at models/region_keyword_distributions.sql. Models are .sql files and this is what it looks like:

  {{ config(materialized='table') }}

  SELECT
     k.keyword,
     c.region,
     COUNT(*) as region_cards_count,
     SUM(CASE WHEN c.keywords LIKE '%' || k.keyword || '%' THEN 1 ELSE 0 END) AS keyword_occurence,
     (SUM(CASE WHEN c.keywords LIKE '%' || k.keyword || '%' THEN 1 ELSE 0 END)::decimal / COUNT(*)::decimal)::decimal(4, 2) AS keyword_perc_of_total_cards
  FROM
      {{ ref('keywords') }} k
          CROSS JOIN {{ ref('cards') }} c
  WHERE
      keyword <> ''
  GROUP BY
      k.keyword,
      c.region

The first thing to notice is the header macro between {{ }}. A macro is a SQL snippet that behaves like a function. This is super useful because it allows us to repeat the same lines of SQL between files without copy pasting them. By concentrating these SQL snippets into one file we avoid mistakes more easily by having to modify only one file where this code appears. This post is an introduction so I won't cover macros in depth, but feel free to read about them here. This macro in our first line is part of the set of macros dbt offers out of the box. However more macros can be found here. These macros are awesome and allow for a lot of customization and automation. In order to install the dbt_util package, head over to the packages.yml file:

  packages:
    - package: fishtown-analytics/dbt_utils
      version: 0.2.3

In this .yml configuration file you specify the package location and the version. Then simply run pipenv run dbt deps to install it. Afterwards packages content can be found in dbt_modules.

Let's circle back to the config macro found in the first line of our model. It simply tells dbt we want the output of our SQL to be a table. Other options are view or ephemeral (ephemeral tables are used as a step to build another table and then are removed, like a CTE if you are familiar with the concept).

The second macro occurs in the FROM clause. {{ ref('keywords') }} tells dbt to query from the reference keywords table. Why not write it directly then? Because doing it that way allows dbt to understand the current model should be executed only when the keywords table has been created, which our current SQL query depends on. Same thing for the cards table with {{ ref('cards') }}. Dependencies between table are automatically handled this way. Our others models at models/keywords.sql and data/cards.csv (more on this type of model below) will therefore be executed first.

Incremental models

Remember above we mentioned that different materializations can be created by specifying it in the model header. In addition to table, view and ephemeral we already covered, one of the most useful feature of dbt is the possibility to create incremental tables. Instead of recreating a table from scratch incremental tables simply append the latest data to a table as specified. For instance if you process your revenue table once a day, you can only add yesterday instead of rerunning it from the beginning by using the incremental config argument. This is how the models/region_keyword_distributions.sql would look like as an incremental model if it ran daily and used an event_time column:

  {{ config(materialized='incremental') }}

  SELECT
     k.keyword,
     c.region,
     COUNT(*) as region_cards_count,
     SUM(CASE WHEN c.keywords LIKE '%' || k.keyword || '%' THEN 1 ELSE 0 END) AS keyword_occurence,
     (SUM(CASE WHEN c.keywords LIKE '%' || k.keyword || '%' THEN 1 ELSE 0 END)::decimal / COUNT(*)::decimal)::decimal(4, 2) AS keyword_perc_of_total_cards
  FROM
      {{ ref('keywords') }} k
          CROSS JOIN {{ ref('cards') }} c
  WHERE
      keyword <> ''

      {% if is.incremental %}

      AND event_time = DATEADD('day', -1, CURRENT_DATE)

      {% endif}

  GROUP BY
      k.keyword,
      c.region

dbt allows us to use if..else logic like in this example to specify that if the model is incremental, then only run yesterday's data. Pretty neat! Another thing you can try is having different if..else logic depending on the target you are using. This is incredibly useful when running a script in a unit test environment. I for instance typically use it this way at work (using the above example again):

  {{ config(materialized='incremental') }}

  SELECT
     k.keyword,
     c.region,
     COUNT(*) as region_cards_count,
     SUM(CASE WHEN c.keywords LIKE '%' || k.keyword || '%' THEN 1 ELSE 0 END) AS keyword_occurence,
     (SUM(CASE WHEN c.keywords LIKE '%' || k.keyword || '%' THEN 1 ELSE 0 END)::decimal / COUNT(*)::decimal)::decimal(4, 2) AS keyword_perc_of_total_cards
  FROM
      {{ ref('keywords') }} k
          CROSS JOIN {{ ref('cards') }} c
  WHERE
      keyword <> ''

      {% if target.name = 'unit_tests' %}

      AND event_time = DATEADD('day', -3, CURRENT_DATE) OR
      event_time = DATEADD('day', -63, CURRENT_DATE)

      {% endif}

  GROUP BY
      k.keyword,
      c.region

This logic simply tells dbt upon execution that if we are in a unit_tests context, only run this SQL for 2 specific days (3 days ago and 63 days ago) as usually we want tests to be fast and not run on the entire data. You can add a lot of customization this way and automatically handle environments constraints.

Finally the --full-refresh argument is another useful dbt trick when running incremental models: by specifying --full-refresh we tell dbt to ignore the fact it's incremental and run it from scratch. It's often used if the data has changed or we modified our table but again it's all controlled with the command line without the need to manually change the code.

Seeds

Before we take our code for a test ride there is a last feature we need to see: seeds. dbt seeds are simply static .csv datasets that we can upload to our database as tables. This is great for files that don't change very often but you still need as part of your ETL. An example for a game would be a design document specifying some immutable characteristics (like a character name and attributes). This is exactly what can be found in our example project data folder in the data/cards.csv file. This file contains every card characteristics. This doesn't change very often (only when the dev team releases update) and is a good use case for a seed model. To run this type of model, use the following command, using --show to display a preview of the table:

pipenv run dbt seed --profiles-dir=. --show

dbt in action!

Now is the fun part! Let's use our newly acquired knowledge and watch how dbt works within our pet project. We are going to execute each command in order until all our transformed data is in. Remember, everything is executed in the dbt folder (adapt your path if needed):

Install dbt dependencies

  pipenv run dbt deps --profiles-dir=.

We successfully installed the dbt_util library. You can navigate to dbt_modules and you'll discover that a dbt_util folder has been created, inside of which everything we might need like macros has been imported.

Run our seed model

below you can see what our database looks like at this point. It's pretty empty so we need to feed it some data, in our case the .csv seed model containing cards information we mentioned above. At work this raw data might already exist and you could directly create models, but for a personal project you will need to import data first, either as a seed, or if you want to take it to the next level a daily/hourly import of data (like financial data for instance).

Let's run our seed model:

  pipenv run dbt seed --profiles-dir=.

The result is successful:

Here you can see the following info is displayed: Concurrency: 4 threads (target='profile_dev'). You can set up the number of thread in your profiles.py file depending on your warehouse capacities (more info here). dbt also informs us it's running using the profile_dev target. This is the default one we setup as part of our profile if you remember, but specify it in the command if you have multiple environments and targets using the --target= argument.

Finally if we check the database we can see the table has been created, awesome! You can by the way observe the legendofruneterra-explorer schema was automatically created. Make sure you have the right to do that in your project (especially at work where permissions between user accounts can differ and can be pretty complicated to setup).

Transform our data

The last step is where the magic happens! We now transform our data to suit our business needs and analysis constraints, or in our case a simple table based on our collection of cards found in the cards table:

  pipenv run dbt run --profiles-dir=.

As we've seen earlier, dbt automatically finds out that keywords should be executed before region_keywords_distribution. And the 2 tables are now available in the database:

That's it, data is now in an actionable format and is ready to be queried into reports or visualization. I hope that with this introduction you are now able to transform your own data and improve your autonomy at work as a data scientist / analyst. Stay tuned and suscribe to my newsletter for more articles on the subject. I aim to post about advanced usage of dbt later on.

Happy coding!

If you want top keep up to date with my latest posts, don't hesitate to follow my Twitter or subscribe to my free newsletter.

DEV Community

Taking Ownership of Our Data Pipelines: Analytics Engineering Using dbt

Analytics engineering: a new paradigm to data extraction and manipulation for analysts and data scientists

Introducing dbt

Requirements

Breaking down a dbt command & main concepts

Targets and profiles

Models and Macros

Incremental models

Seeds

dbt in action!

Install dbt dependencies

Run our seed model

Transform our data

Top comments (0)

Read next

How to install Docker CLI on Windows without Docker Desktop and not die trying

Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis

Create static directory listings on GitHub Actions/Pages

Sitecore PowerShell Script to delete unused items in the Media Library