Goal
Create a process with dbt where it has been used great expectations, with this package. Here is the main repository
Description
Source
For this project, we are going to use as a source Soccer Events
Environment
Environment with python
For this project we are going to install a virtual enviorment under the name env_dbt_ge:
python3 -m venv env_dbt_ge
... and activate
source env_dbt_ge/bin/activate
With this enviorment we could export our requirements with this script:
python -m pip freeze > requirements.txt
Requirements library
For this project only we need these requirements:
dbt-core==1.5.1
dbt-extractor==0.4.1
dbt-postgres==1.5.1
markupsafe==2.0.1
Environment with docker
Actually, I'm working with a MacOs with TMUX for this reason, my docker's preference should be changed like this:
{
"builder": {
"gc": {
"defaultKeepStorage": "20GB",
"enabled": true
}
},
"experimental": false,
"features": {
"buildkit": false
}
}
Steps
0 - Create an env file
I've used this .env file with this project:
DBT_PROFILES_DIR=<YOUR_FOLDER>/projects/dbt_great_expectations
PS_VERSION=15
PS_USER=ps
PS_PASSWORD=ps_2023
PS_DB=RAW
PY_VERSION=3.9-slim-buster
1 - Create our Postgres with dbt
As we've seen it in last article, we could create our docker file for postgres as this way:
1.1 Docker postgres
Dockerfile for postgres
# Dockerfile for postgres
ARG PS_VERSION
FROM postgres:${PS_VERSION}
1.2. Docker dbt
For this case, only we are going to add our new comand: dbt test (to run our desired test) & dbt deps (to download the last version of our package)
ARG PY_VERSION
FROM python:${PY_VERSION}
# Working directory
WORKDIR /app
# Install OS dependencies
RUN apt-get update && apt-get install -qq -y \
git gcc build-essential libpq-dev --fix-missing --no-install-recommends \
&& apt-get clean
# Make sure we are using latest pip
RUN pip install --upgrade pip
# Create directory for dbt config
RUN mkdir -p /root/.dbt
# Copy requirements.txt
COPY requirements.txt requirements.txt
# Install dependencies
RUN pip install -r requirements.txt
# Copy dbt profile
COPY profiles.yml /root/.dbt/profiles.yml
# Copy source code
COPY dbt_project/ /app
WORKDIR /app/dbt_post
# Run 1- Install package + import data + check data
CMD ["/bin/bash", "-c", "dbt deps;dbt seed;dbt test"]
1.3. Docker compose
Then we could have this docker compose to work with our environment:
version: '3.9'
volumes:
db-data:
name: "data_postres"
driver: local
services:
post_db:
build:
context: .
dockerfile: Dockerfile.postgres
args:
- PS_VERSION=${PS_VERSION}
user: postgres
image: postgres/test:v1
environment:
- POSTGRES_USER=${PS_USER}
- POSTGRES_PASSWORD=${PS_PASSWORD}
- PG_DATA:/var/lib/postgresql/data/pgdata
- POSTGRES_DB=${PS_DB}
healthcheck:
test: ["CMD-SHELL","pg_isready -U ${PS_USER} ${PS_PASSWORD}"]
interval: 10s
timeout: 5s
retries: 5
ports:
- "5432:5432"
volumes:
- db-data:/var/lib/postgresql/data
restart: unless-stopped
dbt_python:
build:
context: .
dockerfile: Dockerfile.python
args:
- PY_VERSION=${PY_VERSION}
depends_on:
- post_db
2 - DBT
2.1. - Import csv in our folder:
In this step we should incloude in our folder of dbt_project/seed our csv
2.2 - Add the package
In our folder of dbt_project, we should create a file with name "package.yml" with this code:
packages:
- package: calogica/dbt_expectations
version: [">=0.6.0", "<0.7.0"]
2.3. - Add schema in seed folder
In our folder of dbt_project/seed, we should create a file "schema.yml"
In there, we could include which type of column it is going to be (char, int, ...), and also we could include some test.
columns:
- name: season
tests:
- unique:
config:
store_failures: true
If we choose the preset's test we could use the option of config, and also store_failures. These store_failure's checks it's going to be saved on our desired database under schema: "dbt_test__audit"."name_of_our_test"
The main purpouse is sees how its load in our table in case something goes wrong. For this reason this test should fail.
These tests could be working with great expectation's package:
columns:
- name: id_odsp
tests:
- dbt_expectations.expect_column_values_to_be_unique
In this case, I couldn't see how canonical [the main package] could store faliures in case if something goes wrong.
Example
In here, we could see an example of this case of use:
version: 2
seeds:
- name: ginf
column_types:
id_odsp: varchar(10)
link_odsp: varchar(90)
adv_stats: bool
date: date
league: varchar(4)
season: int
ht: varchar(30)
at: varchar(30)
odd_h: float
odd_d: float
odd_a: float
odd_over: varchar(5)
odd_under: varchar(5)
odd_bts: varchar(5)
odd_bts_n: varchar(5)
columns:
- name: id_odsp
tests:
- dbt_expectations.expect_column_values_to_be_unique
- not_null
- name: season
tests:
- unique:
config:
store_failures: true
2.4. - Save it in other schema without "_"
In case we want to create our tables in other schema, DBT as preset's behavior create an schema like this : "presetschema_createdschema". And we could avoid with this solution
2.4.1 - Add a macro
In the folder macro, we should create a sql file with this code (it doesn't care too much the name of the file):
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
2.4.2 - Set new schema in dbt_project
We should add in dbt_project
seeds:
+schema: src
Conclusion
It's really amazing the work of this dbt package, it brings a lot of features of great expecations to DBT. However, I've seen a little missing details to achive a greate feature as could be:
- Lack of documentation: Currently if you use this package to try to document your database. You are going to see something like this:
It is not really clear to share with the main customere this kind of input.
Revision check: in case you use the presets factory of dbt, you could see the data's behavior. Otherwise, in case you are going to work with this great expecation package, you will need some help too see where is the error.
PySpark no compatibility: in the case of great expecations, you could work with a Pyspark's cluster. And this is little sad, because this could be helpful in dbt's world.
Also, I feel this package have a lot of cool features and in the future it is going to be an essential one.
Top comments (0)