In recent time you have heard about the DBT (Data Build Tool) a lot, Let's explore the power of the DBT with Amazon Redshift. We will develop the Data pipelines using DBT, Redshift as our data warehouse and Power BI for visualization.
What is DBT?
Let's first understand what exactly DBT is and its use case.
Data Build Tool aka DBT is an open-source tool that helps you in applying transformation using the best practices of Analytics engineering.
I'm not going to explain the terms Extract Transform Load (ETL) and Extract Load Transform (ELT) I assume that you're familiar with these terms. The Transformation step in being applied in DBT.
There're two ways to access DBT.
- DBT Core
- DBT Cloud
If you're GUI kind of person go with DBT cloud and if you love to work with terminals, then go with DBT Core. However, the Commands would not be that difficult familiarity with basic commands like ls, cd, pwd and some dbt commands are enough to work. For this project I'll go with DBT Core.
Redshift
Redshift is a cloud-based warehouse service provided by Amazon. It uses a Massive Parallel Processing (MPP) architecture, which distribute the data and processing across multiple nodes to improve query performance.
Photo by Daniel Josef on Unsplash
It contains the cluster which is composed of leader and compute nodes you can further read about its architecture in detail here.
Power BI
Power BI is business intelligence tool by Microsoft. You can highly interactive visualization by just drag and drop. Provides plenty of data connection options as well.
If you're interested in Power BI you can further learn about it here.
Dataset
The dataset I'm using is the Sakila database. You can find the scripts to create tables and insert data to tables in following repository.
Note: These scripts are specific to Amazon Redshift. Probably these scripts will throw an error on other databases.
1- Create environment
you should create a specific python environment for this project in order to avoid any conflicts.
If you don't have virtualenv library already installed, then run.
pip install virtualenv
In order to create a virtual environment, you can run the following command.
python -m venv
to activate the environment, run
<environment-name>/Scripts/activate.bat
Once the environment activated, the environment name will appear in your command line before the path.
Environment Activation
Note: The commands differ for different OS, the above-mentioned commands are specific to windows.
2- DBT installation
It's time to install the DBT, before installation of DBT make sure you've Python version 3.7 or above it doesn't support version below 3.7 as per their documentation, may be these changes with time to time. you can read about the supported versions here.
We're using the redshift so we will use the redshift adapter, if you're planning to use some other adapter then command will vary accordingly. if you're going along run the following command for redshift
pip install dbt-redshift
Most of the things are handled by the DBT on its own related to the project you can create or initialize a project by just running the command.
dbt init
The above command will create a project along with the boilerplate. easy peasy right? okay then what's next.
3-RedShift Cluster Setup
Before preceding next, we will setup our Amazon Redshift cluster and allow the public accessibility. Public accessibility isn't recommended you can use the VPC but for demo purposes we can proceed.
- Create Redshift cluster.
- Add inbound rules in security
- Allow accessibility
Once the cluster is setup, open the cluster properties of the cluster note down the Endpoint and connect it locally, I already have aqua data studio, so I connected through it.
Connection to Redshift
Note: Redshift cost varies by different factor, so make sure to create a billing alert, so you can receive updates regarding the cost. Furthermore, you can read here.
Once you connected to your warehouse then create a schema called stg inside database and run the scripts of tables creation and insertion, the repository of the scripts is mentioned above.
Now you've the data in staging layer and you want to load it into warehouse. The data is already cleansed so there's no need to introduce the transformation layer between staging and target layer.
As per the requirements of the clients we have to decide to go with sort of Galaxy schema.
Model Time to build models but ML models, I'm talking about the DBT models where you define your core logic. Inside the Models directory create two sub directories for dimension and fact. In each sub-directory create a schema.yml file.
This schema file contains the information about the source and contain some tests. This is how the schema file for customer dimension will look like:
version: 2
models:
- name: dim_customer
description: "Dim customer to join customer with city,address and country"
columns:
- name: customer_id
description: "The primary key for this table"
tests:
- unique
- not_null
sources:
- name: stg
database: dev
schema: stg
tables:
- name: customer
- name: address
- name: city
- name: country
The best practice while developing your model is to use Common Table Expression (CTE) as it enhances the readability of the code though it's not necessary.
Now let's create a customer dim which contains the details of the customer, the details from address,city and country source tables.
with customer_base as(
SELECT *,
CONCAT(CONCAT(customer.FIRST_NAME,' '),customer.LAST_NAME) AS FULL_NAME,
SUBSTRING(customer.EMAIL FROM POSITION('@' IN customer.EMAIL)+1 FOR CHAR_LENGTH(customer.EMAIL)-POSITION('@' IN EMAIL)) AS DOMAIN,
customer.active::int as ACTIVE_INT,
CASE WHEN customer.ACTIVE=0 then 'no' else 'yes' end as ACTIVE_DESC,
'{{ run_started_at.strftime("%Y-%m-%d %H:%M:%S")}}' as DBT_TIME
FROM
{{ source('stg','customer')}} as customer
),
address as (
SELECT * FROM
{{ source('stg','address')}}
),
city as (
SELECT * FROM
{{ source('stg','city')}}
),
country as (
SELECT * FROM
{{ source('stg','country')}}
)
SELECT
customer_base.CUSTOMER_ID,
customer_base.STORE_ID,
customer_base.FIRST_NAME,
customer_base.LAST_NAME,
customer_base.FULL_NAME,
customer_base.EMAIL,
customer_base.DOMAIN,
customer_base.ACTIVE_INT AS ACTIVE,
customer_base.ACTIVE_DESC,
customer_base.create_date,
customer_base.last_update,
customer_base.DBT_TIME,
address.ADDRESS_ID::INT,
address.address,
city.city_id,
city.city,
country.country_id,
country.country
FROM customer_base
LEFT JOIN ADDRESS AS address
ON customer_base.address_id= address.address_id
LEFT JOIN CITY AS city
ON address.city_id=city.city_id
LEFT JOIN COUNTRY AS country
ON country.country_id=city.country_id
This is how the models are defined is DBT and in parallel if you open your project.yml file at the very bottom of the file it contains the detail about models that how the tables will be materialized.
models:
project:
# Config indicated by + and applies to all files under models/example/
example:
+materialized: view
dimension:
+materialized: table
+schema: dwh
fact:
+materialized: incremental
+schema: dwh
Note: The properties define inside the models has more preference than the one defined in project.yml
Similarly, like customer dimension we defined other dimensions and facts however the facts are materialized incrementally so that it can save cost while rerunning the models again and again, as fact has large number of records.
Now you defined all the models and your target destination in the DBT profile it's time to run the models. To run the models hit the following command.
dbt run -m dimensions
This command will run all the models in dimensions directory. However, if you want to run the specific model then try this command.
dbt run -s model_name
Once you run this command and everything is defined correctly then data will be inserted into your target directory dwh in my case.
That's how you can build your DBT pipeline. If you have good knowledge of SQL and a bit of Python, then you are good to develop complex pipelines on your own.
Finally, our warehouse is ready, users can now perform the analysis as per their requirements. Now there's requirement from the user to build a dashboard upon that cleansed data. We have access to Power BI desktop. In order to make a connection with Redshift we have to provide the following details in Power BI.
server-name: End point of your Redshift
Database: dbname
username: username
password: password
Once you provide the following details then you'll either direct query the source or can import the tables. In import tables, data is cached inside Power BI and in direct query, it'll directly hit the source to fetch data.
Now you can play around and build an amazing dashboard for your user. for the demo purpose I've built this, but it can far improve by utilizing the DAX functions.
You can find the code related to the project codes here.
That's it, Tada :D.
Conclusions:
- Create a separate environment for the project, choose your adopter beforehand.
- Define the tests and documentations
- Run individual models if you're working on cloud, it can add cost if you run all the models
- Modularize your logic, so you use the same logic at multiple places by reference.
- The property defined inside model holds more value than defined inside project.yml file, it basically overwrites those properties
- Go for incremental materialization, if your data is quite large.
Top comments (0)