Mpho Mphego

Posted on Jan 18, 2022 • Originally published at blog.mphomphego.co.za on Jan 9, 2022

How To Build An ETL Using Python, Docker, PostgreSQL And Airflow

#etl #docker #python #airflow

30 Min Read

Updated: 2022-02-18 06:54:15 +02:00

The Story

During the past few years, I have developed an interest in Machine Learning but never wrote much about the topic. In this post, I want to share some insights about the foundational layers of the ML stack. I will start with the basics of the ML stack and then move on to the more advanced topics.

This post will detail how to build an ETL (Extract, Transform and Load) using Python, Docker, PostgreSQL and Airflow.

You will need to sit down comfortably for this one, it will not be a quick read.

Before we get started, let's take a look at what ETL is and why it is important.

One of the foundational layers when it comes to Machine Learning is ETL(Extract, Transform and Load). According to Wikipedia:

ETL is the general procedure of copying data from one or more sources into a destination system that represents the data differently from the source(s) or in a different context than the source(s).
Data extraction involves extracting data from (one or more) homogeneous or heterogeneous sources; data transformation processes data by data cleaning and transforming it into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, data lake or a data warehouse.

One might begin to wonder, Why do we need an ETL pipeline?

Assume we had a set of data that we wanted to use. However, this data is unclean, missing information, and inconsistent as with most data. One solution would be to have a program clean and transform this data so that:

There is no missing information
Data is consistent
Data is fast to load into another program

With smart devices, online communities, and E-Commerce, there is an abundance of raw, unfiltered data in today's industry. However, most of it is squandered because it is difficult to interpret due to it being tangled. ETL pipelines are available to combat this by automating data collection and transformation so that analysts can use them for business insights.

There are a lot of different tools and frameworks that are used to build ETL pipelines. In this post, I will focus on how one can tediously build an ETL using Python, Docker, PostgreSQL and Airflow tools.

TL;DR

There's no free lunch. Read the whole post.

Code used in this post is available on https://github.com/mmphego/simple-etl

The How

For this post, we will be using the data from UC-Irvine machine learning recognition datasets. This dataset contains Wine Quality information and it is a result of chemical analysis of various wines grown in Portugal.

We will need to extract the data from a public repository (for this post I went ahead and uploaded the data to gist.github.com) and transform it into a format that can be used by ML algorithms (not part of this post), thereafter we will load both raw and transformed data into a PostgreSQL database running in a Docker container, then create a DAG that will run an ETL pipeline periodically. The DAG will be used to run the ETL pipeline in Airflow.

The Walk-through

Before we can do any transformation, we need to extract the data from a public repository. Using Python and Pandas, we will extract the data from a public repository and upload the raw data to a PostgreSQL database. This assumes that we have an existing PostgreSQL database running in a Docker container.

The Setup

Let's start by setting up our environment. First, we will set up our Jupyter Notebook and PostgreSQL database. Then, we will set up Apache Airflow (a fancy cron-like scheduler).

Setup PostgreSQL and Jupyter Notebook

In this section, we will set up the PostgreSQL database and Jupyter Notebook. First, we will need to create a .env file in the project directory. This file will contain the PostgreSQL database credentials which are needed in the docker-compose.yml file.

cat << EOF > .env
POSTGRES_DB=winequality
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_HOST=database
POSTGRES_PORT=5432
EOF

Once we have the .env file, we can create a Postgres container instance that we will use as our Data Warehouse.
The code below will create a docker-compose.yaml file that will contain all the necessary information to run the container including a Jupyter Notebook that we can use to interact with the container and/or data.

cat << EOF > postgres-docker-compose.yaml
version: "3.8"
# Optional Jupyter Notebook service
services:
  jupyter_notebook:
    image: "jupyter/minimal-notebook"
    container_name: ${CONTAINER_NAME:-jupyter_notebook}
    environment:
      JUPYTER_ENABLE_LAB: "yes"
    ports:
      - "8888:8888"
    volumes:
      - ${PWD}:/home/jovyan/work
    depends_on:
      - database
    links:
      - database
    networks:
      - etl_network

  database:
    image: "postgres:11"
    container_name: ${CONTAINER_NAME:-database}
    ports:
      - "5432:5432"
    expose:
      - "5432"
    environment:
      POSTGRES_DB: "${POSTGRES_DB}"
      POSTGRES_HOST: "${POSTGRES_HOST}"
      POSTGRES_PASSWORD: "${POSTGRES_PASSWORD}"
      POSTGRES_PORT: "${POSTGRES_PORT}"
      POSTGRES_USER: "${POSTGRES_USER}"
    healthcheck:
      test:
        [
          "CMD",
          "pg_isready",
          "-U",
          "${POSTGRES_USER}",
          "-d",
          "${POSTGRES_DB}"
        ]
      interval: 5s
      retries: 5
    restart: always
    volumes:
      - /tmp/pg-data/:/var/lib/postgresql/data/
      - ./init-db.sql:/docker-entrypoint-initdb.d/init.sql
    networks:
      - etl_network

volumes:
  dbdata: null

# Create a custom network for bridging the containers
networks:
  etl_network: null
EOF

But before we can run the container, we need to create the init-db.sql file that will contain the SQL command to create the database. This file will be our entrypoint into the container. Read more about Postgres Docker entrypoint here.

cat << EOF > init-db.sql
CREATE DATABASE ${POSTGRES_DB};
EOF

After creating the postgres-docker-compose.yaml file, we need to source the .env file, create a docker network (the docker network will ensure all containers are interconnected) and then run the docker-compose up command to start the container.

Note the current local directory is mounted to the /home/jovyan/work directory in the container. This is done to allow the container to access the data in the local directory. ie all the files in the local directory will be available in the container.

source .env
# Install yq (https://github.com/mikefarah/yq/#install) to parse the YAML file and retrieve the network name
NETWORK_NAME=$(yq eval '.networks' postgres-docker-compose.yaml | cut -f 1 -d':')
docker network create $NETWORK_NAME
# or hardcode the network name from the YAML file
# docker network create etl_network
docker-compose --env-file ./.env -f ./postgres-docker-compose.yaml up -d

When we run the docker-compose up command, we will see the following output:

Starting database_1 ... done
Starting jupyter_notebook_1 ... done

Since the container is running in detached mode, we will need to run the docker-compose logs command to see the logs and retrieve the URL of the Jupyter Notebook. The command below will print the URL (with access token) of the Jupyter Notebook.

docker logs $(docker ps -q --filter "ancestor=jupyter/minimal-notebook") 2>&1 | grep 'http://127.0.0.1' | tail -1

Once everything is running, we can open the Jupyter Notebook in the browser using the URL from the logs and have fun.

Setup Airflow

In this section, we will set up the Airflow environment. A quick overview of the Airflow environment, Apache Airflow, is an open-source tool for orchestrating complex computational workflows and creating a data processing pipeline. Think of it as a fancy version of a job scheduler or cron job. A workflow is a series of tasks that are executed in a specific order and we call them DAGs. A DAG (Directed Acyclic Graph) is a graph that contains a set of tasks that are connected by dependencies or a graph with nodes connected via directed edges.

The image below shows an example of a DAG.

Okay now that we got the basics of what Airflow and DAGs are, let's set up Airflow. First, we will need to create our custom Airflow Docker image. This image adds and installs a list of Python packages that we will need to run the ETL (Extract, Transform and Load) pipeline.

First, let's create a list of Python packages that we will need to install.

Run the following command to create the requirements.txt file:

cat << EOF > requirements.txt
pandas==1.3.5
psycopg2-binary==2.8.6
python-dotenv==0.19.2
SQLAlchemy==1.3.24
EOF

Then we will create a Dockerfile that will install the required Python packages (Ideally, we should only install packages in a virtual environment but for this post, we will install all packages in the Dockerfile).

cat << EOF > airflow-dockerfile
FROM apache/airflow:2.2.3
ADD requirements.txt /usr/local/airflow/requirements.txt
RUN pip install --no-cache-dir -U pip setuptools wheel
RUN pip install --no-cache-dir -r /usr/local/airflow/requirements.txt
EOF

Now we can create a Docker compose file that will run the Airflow container. The airflow-docker-compose.yaml below is a modified version of the official Airflow Docker. We have added the following changes:

Customized Airflow image that includes the installation of Python dependencies.
Removes example DAGs and reloads DAGs every 60seconds.
Memory limitation set to 4GB.
Allocated on 2 workers to run Gunicorn web server.
Add our .env file to the Airflow container and,
A custom network for bridging the containers (Jupyter, PostgresDB and Airflow).

The airflow-docker-compose.yaml file when deployed will start a list of containers namely:

airflow-scheduler - The scheduler monitors all tasks and DAGs, then triggers the task instances once their dependencies are complete.
airflow-webserver - The webserver is available at http://localhost:8080.
airflow-worker - The worker that executes the tasks given by the scheduler.
airflow-init - The initialization service.
flower - The flower app for monitoring the environment. It is available at http:/
localhost:5555.
postgres - The database.
redis - The redis-broker that forwards messages from scheduler to worker.

cat << EOF > airflow-docker-compose.yaml
---
version: '3'
x-airflow-common:
  &airflow-common
  build:
    context: .
    dockerfile: airflow-dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    # Scan for DAGs every 60 seconds
    AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: '60'
    AIRFLOW__WEBSERVER__SECRET_KEY: '3d6f45a5fc12445dbac2f59c3b6c7cb1'
    # Prevent airflow from reloading the dags all the time and set:
    AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: '60'
    # 2 * NUM_CPU_CORES + 1
    AIRFLOW__WEBSERVER__WORKERS: '2'
    # Kill workers if they don't start within 5min instead of 2min
    AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT: '300'

  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins

  env_file:
    - ./.env
  user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
  mem_limit: 4000m
  depends_on:
    redis:
      condition: service_healthy
    postgres:
      condition: service_healthy
  networks:
    - etl_network

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: [ "CMD", "pg_isready", "-U", "airflow" ]
      interval: 5s
      retries: 5
    restart: always
    networks:
      - etl_network

  redis:
    image: redis:latest
    ports:
      - 6379:6379
    healthcheck:
      test: [ "CMD", "redis-cli", "ping" ]
      interval: 5s
      timeout: 30s
      retries: 50
    restart: always
    mem_limit: 4000m
    networks:
      - etl_network

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - 8080:8080
    healthcheck:
      test:
        [
          "CMD",
          "curl",
          "--fail",
          "http://localhost:8080/health"
        ]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test:
        [
          "CMD-SHELL",
          'airflow jobs check --job-type SchedulerJob --hostname
            "$${HOSTNAME}"'
        ]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d
          "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-init:
    <<: *airflow-common
    command: version
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_UPGRADE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}

  flower:
    <<: *airflow-common
    command: celery flower
    ports:
      - 5555:5555
    healthcheck:
      test: [ "CMD", "curl", "--fail", "http://localhost:5555/" ]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    mem_limit: 4000m

volumes:
  postgres-db-volume: null

# Create a custom network for bridging the containers
networks:
  etl_network: null
EOF

Before starting Airflow for the first time, we need to prepare our environment. We need to add the Airflow USER to our .env file because some of the container's directories that we mount, will not be owned by the root user. The directories are:

./dags - you can put your DAG files here.
./logs - contains logs from task execution and scheduler.
./plugins - you can put your custom plugins here.

The following commands will create the Airflow User & Group IDs and directories.

mkdir -p ./dags ./logs ./plugins
chmod -R 777 ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" >> .env
echo -e "AIRFLOW_GID=0" >> .env

After that, we need to initialize the Airflow database. We can do this by running the following command:

docker-compose -f airflow-docker-compose.yaml up airflow-init

This will create the Airflow database and the Airflow USER.
Once we have the Airflow database and the Airflow USER, we can start the Airflow services.

docker-compose -f airflow-docker-compose.yaml up -d

Running docker ps will show us the list of containers running and we should make sure that the status of all containers is Up as shown in the image below.

Once we have confirmed that Airflow, Jupyter and database services are running, we can start the Airflow webserver.

The webserver is available at http://localhost:8080. The default account has the login airflow and the password airflow.

Now that all the hard work is done. We can create our ETL and DAGs.

Memory and CPU utilization

When all the containers are running, you can experience system lag if your system is not able to handle the load. Monitoring the CPU and Memory utilization of the containers is crucial to maintaining good performance and a reliable system. To monitor the CPU and Memory utilization of the containers, we use the Docker command-line tool stats command, which gives us a live look at our containers resource utilization. We can use this tool to gauge the CPU, Memory, Network, and disk utilization of every running container.

docker stats

The output of the above command will look like the following:

CONTAINER ID   NAME                          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
c857cddcac2b   dataeng_airflow-scheduler_1   89.61%    198.8MiB / 3.906GiB   4.97%     2.49MB / 3.72MB   0B / 0B           3
b4be499c5e4f   dataeng_airflow-worker_1      0.29%     1.286GiB / 3.906GiB   32.93%    304kB / 333kB     0B / 172kB        21
20af4408fd3d   dataeng_flower_1              0.14%     156.1MiB / 3.906GiB   3.90%     155kB / 93.4kB    0B / 0B           74
075bb3178876   dataeng_airflow-webserver_1   0.11%     715.4MiB / 3.906GiB   17.89%    1.19MB / 808kB    0B / 8.19kB       30
967341194e93   dataeng_postgres_1            4.89%     43.43MiB / 15.26GiB   0.28%     4.85MB / 4.12MB   0B / 4.49MB       15
a0de99b6e4b5   dataeng_redis_1               0.12%     7.145MiB / 15.26GiB   0.05%     413kB / 428kB     0B / 4.1kB        5
6ad0eacdcfe2   jupyter_notebook              0.00%     128.7MiB / 15.26GiB   0.82%     800kB / 5.87MB    91.2MB / 12.3kB   3
4ba2e98a551a   database                      6.80%     25.97MiB / 15.26GiB   0.17%     19.7kB / 0B       94.2kB / 1.08MB   7

Clean Up

To stop and remove all the containers, including the bridge network, run the following command:

docker-compose -f airflow-docker-compose.yaml down --volumes --rmi all
docker-compose -f postgres-docker-compose.yaml down --volumes --rmi all
docker network rm etl_network

Extract, Transform and Load

Now that we have Jupyter, Airflow and Postgres services running, we can start creating our ETL. Open the Jupyter notebook and create a new notebook called Simple ETL. For this post, we will use the

Step 0: Install the required libraries

We need to install the required libraries for our ETL, these include:

pandas: Used for data manipulation
python-dotenv: Used for loading environment variables
SQLAlchemy: Used for connecting to databases (Postgres)
psycopg2: Postgres adapter for SQLAlchemy

!pip install -r requirements.txt

Step 1: Import libraries and load the environment variables

The first step is to import all the modules, load the environment variables and create the connection_uri variable that will be used to connect to the Postgres database.

import os

import pandas as pd

from dotenv import dotenv_values
from sqlalchemy import create_engine, inspect

CONFIG = dotenv_values('.env')
if not CONFIG:
    CONFIG = os.environ

connection_uri = "postgresql+psycopg2://{}:{}@{}:{}".format(
    CONFIG["POSTGRES_USER"],
    CONFIG["POSTGRES_PASSWORD"],
    CONFIG['POSTGRES_HOST'],
    CONFIG["POSTGRES_PORT"],
)

Step 2: Create a connection to the Postgres database
We will treat this database as a fake production database, that will house both our raw and transformed data.

engine = create_engine(connection_uri, pool_pre_ping=True)
engine.connect()

Step 3: Extract the data from the hosting service

Once we have a connection to the Postgres database, we can pull a copy of the UC-Irvine machine learning recognition datasets that I recently uploaded to https://gist.github.com/mmphego

dataset = "https://gist.githubusercontent.com/mmphego/5b6fc4d6dc3c8fba4fce9d994a2fe16b/raw/ab5df0e76812e13df5b31e466a5fb787fac0599a/wine_quality.csv"

df = pd.read_csv(dataset)

It is always a good idea to check the data before you start working with it.

df.head()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    table-layout: fixed;
    border-collapse: collapse;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	winecolor
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6	white
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6	white
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6	white
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6	white
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6	white

We also need to have an understanding of the data types that we will be working with. This will give us a clear indication of some features we need to engineer or any missing values that we need to fill in.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64
 12  winecolor             6497 non-null   object
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB

From the above information, we can see that there are a total of 6497 rows and 13 columns. But the 13th column is the winecolor column and it does not contain numerical values. We need to convert/transform this column into numerical values.

Now, let's check the table summary which gives us a quick overview of the data this includes the count, mean, standard deviation, min, max, 25th percentile, 50th percentile, 75th percentile, and the number of null values.

df.describe()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000
mean	7.215307	0.339666	0.318633	5.443235	0.056034	30.525319	115.744574	0.994697	3.218501	0.531268	10.491801	5.818378
std	1.296434	0.164636	0.145318	4.757804	0.035034	17.749400	56.521855	0.002999	0.160787	0.148806	1.192712	0.873255
min	3.800000	0.080000	0.000000	0.600000	0.009000	1.000000	6.000000	0.987110	2.720000	0.220000	8.000000	3.000000
25%	6.400000	0.230000	0.250000	1.800000	0.038000	17.000000	77.000000	0.992340	3.110000	0.430000	9.500000	5.000000
50%	7.000000	0.290000	0.310000	3.000000	0.047000	29.000000	118.000000	0.994890	3.210000	0.510000	10.300000	6.000000
75%	7.700000	0.400000	0.390000	8.100000	0.065000	41.000000	156.000000	0.996990	3.320000	0.600000	11.300000	6.000000
max	15.900000	1.580000	1.660000	65.800000	0.611000	289.000000	440.000000	1.038980	4.010000	2.000000	14.900000	9.000000

Looking at the data, we can see a few things:

Since our data contains categorical variables (winecolor), we can use one-hot encoding to transform the categorical variables into binary variables
We can normalize the data by transforming it to have zero mean, this will ensure that the data is centred around zero ie standardize the data

Step 4: Transform the data into usable format

Now that we have an idea of what our data looks like, we can use the pandas.get_dummies function to transform the categorical variables into binary variables then drop the original categorical variables.

df_transform = df.copy()
winecolor_encoded = pd.get_dummies(df_transform['winecolor'], prefix='winecolor')
df_transform[winecolor_encoded.columns.to_list()] = winecolor_encoded
df_transform.drop('winecolor', axis=1, inplace=True)

Then we can normalize the data by subtracting the mean and dividing by the standard deviation. This will ensure that the data is centred around zero and has a standard deviation of 1. Instead of using sklearn.preprocessing.StandardScaler, we will use the z-score normalization (also known as standardization) method.

for column in df_transform.columns:
    df_transform[column] = (df_transform[column] -
        df_transform[column].mean()) / df_transform[column].std()

After transforming the data, we can now take a look:

df_transform.head()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	winecolor_red	winecolor_white
0	-0.166076	-0.423150	0.284664	3.206682	-0.314951	0.815503	0.959902	2.102052	-1.358944	-0.546136	-1.418449	0.207983	-0.571323	0.571323
1	-0.706019	-0.240931	0.147035	-0.807775	-0.200775	-0.931035	0.287595	-0.232314	0.506876	-0.277330	-0.831551	0.207983	-0.571323	0.571323
2	0.682405	-0.362411	0.559923	0.306184	-0.172231	-0.029596	-0.331634	0.134515	0.258100	-0.613338	-0.328496	0.207983	-0.571323	0.571323
3	-0.011807	-0.666110	0.009405	0.642474	0.056121	0.928182	1.242978	0.301255	-0.177258	-0.882144	-0.496181	0.207983	-0.571323	0.571323
4	-0.011807	-0.666110	0.009405	0.642474	0.056121	0.928182	1.242978	0.301255	-0.177258	-0.882144	-0.496181	0.207983	-0.571323	0.571323

Then check how the data looks like after normalization:

df_transform.describe()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	winecolor_red	winecolor_white
count	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6497.000000	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03
mean	2.099803e-16	-2.449770e-16	3.499672e-17	3.499672e-17	-3.499672e-17	-8.749179e-17	0.000000	-3.517170e-15	2.720995e-15	2.099803e-16	-8.399212e-16	-2.821610e-16	-3.499672e-17	1.749836e-16
std	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00
min	-2.634386e+00	-1.577208e+00	-2.192664e+00	-1.017956e+00	-1.342536e+00	-1.663455e+00	-1.941631	-2.529997e+00	-3.100376e+00	-2.091774e+00	-2.089189e+00	-3.227439e+00	-5.713226e-01	-1.750055e+00
25%	-6.288845e-01	-6.661100e-01	-4.722972e-01	-7.657389e-01	-5.147590e-01	-7.620156e-01	-0.685480	-7.858922e-01	-6.748102e-01	-6.805395e-01	-8.315512e-01	-9.371575e-01	-5.713226e-01	5.713226e-01
50%	-1.660764e-01	-3.016707e-01	-5.940918e-02	-5.135217e-01	-2.578628e-01	-8.593639e-02	0.039904	6.448391e-02	-5.287017e-02	-1.429263e-01	-1.608107e-01	2.079830e-01	-5.713226e-01	5.713226e-01
75%	3.738663e-01	3.664680e-01	4.911081e-01	5.584015e-01	2.559297e-01	5.901428e-01	0.712210	7.647937e-01	6.312639e-01	4.618885e-01	6.776148e-01	2.079830e-01	-5.713226e-01	5.713226e-01
max	6.698910e+00	7.533774e+00	9.230570e+00	1.268585e+01	1.584097e+01	1.456245e+01	5.736815	1.476765e+01	4.922650e+00	9.870119e+00	3.695947e+00	3.643405e+00	1.750055e+00	5.713226e-01

Step 5: Load the data into a database

If we are happy with the results, then we can load both dataframes into our database. Since we do not have any tables in our database and our dataset is small, we can get away by using the .to_sql method to write the data to a table in the database.

raw_table_name = 'raw_wine_quality_dataset'
df.to_sql(raw_table_name, engine, if_exists='replace')
transformed_table_name = 'clean_wine_quality_dataset'
df_transformed.to_sql(transformed_table_name, engine, if_exists='replace')

This will create two tables in our database, namely raw_wine_quality_dataset and clean_wine_quality_dataset.

For a sanity check, we can verify that the data in both tables were successfully written to the database, using the following query:

def check_table_exists(table_name, engine):
    if table_name in inspect(engine).get_table_names():
        print(f"{table_name!r} exists in the DB!")
    else:
        print(f"{table_name} does not exist in the DB!")

check_table_exists(raw_table_name, engine)
check_table_exists(transformed_table_name, engine)

Well, that was a lot of work! But, we can do even better! We can use the .read_sql method to read the data from the database and then use the .drop_duplicates method to remove the duplicate rows.

pd.read_sql(f"SELECT * FROM {raw_table_name}", engine)
pd.read_sql(f"SELECT * FROM {transformed_table_name}", engine)

Well done, we successfully wrote our data into the database. Our ETL pipeline is now complete the only thing left to do is to make it repeatable via Airflow.

Airflow ETL Pipeline

Now that we have an ETL pipeline that can be run in Airflow, we can start building our Airflow DAG.

We can reuse our jupyter notebook and ensure that the DAG is written to file as a Python script by using the magic command %%writefile dags/simple_etl_dag.py

Step 1: Import necessary

But first, we need to import the necessary libraries and to create a DAG in Airflow, you always have to import the DAG class from airflow.models. Then import the PythonOperator (since we will be executing Python logic) and finally, import days_ago to get a datetime object representation of n days ago.

import os

from functools import wraps

import pandas as pd

from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python import PythonOperator

from dotenv import dotenv_values
from sqlalchemy import create_engine, inspect

Step 2: Create a DAG object

After importing the necessary libraries, we can create a DAG object. We will use the DAG class from airflow.models to create a DAG object. A DAG object must have a dag_id, a schedule_interval, and a start_date. The dag_id is a unique name of the DAG, and the schedule_interval is the interval at which the DAG will be executed. The start_date is the date at which the DAG will start. We can also add a default_args parameter to the DAG object, which is a dictionary of default arguments that may include Owners information, a description, and a default start_date.

args = {"owner": "Airflow", "start_date": days_ago(1)}

dag = DAG(dag_id="simple_etl_dag", default_args=args, schedule_interval=None)

Step 3: Define a logging function

For the sake of simplicity, we will create a simple (decorator) logging function that will be used to log the execution of the DAG using print statements of course.


def logger(fn):
    from datetime import datetime, timezone

    @wraps(fn)
    def inner(*args, **kwargs):
        called_at = datetime.now(timezone.utc)
        print(f">>> Running {fn.__name__!r} function. Logged at {called_at}")
        to_execute = fn(*args, **kwargs)
        print(f">>> Function: {fn.__name__!r} executed. Logged at {called_at}")
        return to_execute

    return inner

Step 4: Create an ETL function

We will refactor the ETL pipeline we defined above into be a function that can be called by the DAG and use the logger function to log the execution of the function.

DATASET_URL = "https://gist.githubusercontent.com/mmphego/5b6fc4d6dc3c8fba4fce9d994a2fe16b/raw/ab5df0e76812e13df5b31e466a5fb787fac0599a/wine_quality.csv"


CONFIG = dotenv_values(".env")
if not CONFIG:
    CONFIG = os.environ


@logger
def connect_db():
    print("Connecting to DB")
    connection_uri = "postgresql+psycopg2://{}:{}@{}:{}".format(
        CONFIG["POSTGRES_USER"],
        CONFIG["POSTGRES_PASSWORD"],
        CONFIG["POSTGRES_HOST"],
        CONFIG["POSTGRES_PORT"],
    )

    engine = create_engine(connection_uri, pool_pre_ping=True)
    engine.connect()
    return engine


@logger
def extract(dataset_url):
    print(f"Reading dataset from {dataset_url}")
    df = pd.read_csv(dataset_url)
    return df


@logger
def transform(df):
    # transformation
    print("Transforming data")
    df_transform = df.copy()
    winecolor_encoded = pd.get_dummies(df_transform["winecolor"], prefix="winecolor")
    df_transform[winecolor_encoded.columns.to_list()] = winecolor_encoded
    df_transform.drop("winecolor", axis=1, inplace=True)

    for column in df_transform.columns:
        df_transform[column] = (
            df_transform[column] - df_transform[column].mean()
        ) / df_transform[column].std()
    return df

@logger
def check_table_exists(table_name, engine):
    if table_name in inspect(engine).get_table_names():
        print(f"{table_name!r} exists in the DB!")
    else:
        print(f"{table_name} does not exist in the DB!")


@logger
def load_to_db(df, table_name, engine):
    print(f"Loading dataframe to DB on table: {table_name}")
    df.to_sql(table_name, engine, if_exists="replace")
    check_table_exists(table_name, engine)

@logger
def tables_exists():
    db_engine = connect_db()
    print("Checking if tables exists")
    check_table_exists("raw_wine_quality_dataset", db_engine)
    check_table_exists("clean_wine_quality_dataset", db_engine)
    db_engine.dispose()

@logger
def etl():
    db_engine = connect_db()

    raw_df = extract(DATASET_URL)
    raw_table_name = "raw_wine_quality_dataset"

    clean_df = transform(raw_df)
    clean_table_name = "clean_wine_quality_dataset"

    load_to_db(raw_df, raw_table_name, db_engine)
    load_to_db(clean_df, clean_table_name, db_engine)

    db_engine.dispose()

Step 5: Create a PythonOperator

Now that we have our ETL function defined, we can create a PythonOperator that will execute the ETL and data verification function. One of the best practices is to use context managers thus avoiding the need to add dag=dag to your task which might result in Airflow errors.

with dag:
    run_etl_task = PythonOperator(task_id="run_etl_task", python_callable=etl)
    run_tables_exists_task = PythonOperator(
        task_id="run_tables_exists_task", python_callable=tables_exists)

    run_etl_task >> run_tables_exists_task

That's it! Now, we can head out to the Airflow UI and check if our DAG was created successfully.

Step 6: Run the DAG

After we log in to the Airflow UI, we should notice that the DAG was created successfully. You should see an image similar to the one below.

If we are happy with the DAG, we can now run the DAG by clicking on the green play button and selecting Trigger DAG. This will start the DAG execution

Let's open the last successful run of the DAG and see the logs. The image below shows the graph representation of the DAG

Looks like the DAG was executed successfully, everything is Green!
Now, we can check the logs of the DAG to see the execution of the ETL function by clicking on an individual task and then clicking on the Logs tab.

The logs show that the ETL function was executed successfully.

This now concludes this post. If you have gotten this far, I hope you enjoyed this post and found it useful.

Conclusion

In this post, We have covered the basics of creating your very own ETL pipeline, how to run multiple docker containers interconnected, Data manipulation and feature engineering techniques, simple techniques on reading and writing data to a database, and finally, how to create a DAG in Airflow. This has been a great learning experience and I hope you found this post useful. In the next post, I will explore a less tedious way of creating an ETL pipeline using AWS services. So stick around and learn more!

FYI it took me a week to write this post. I was trying to get a better understanding of Docker networking, Postgres Fundamentals, Airflow ecosystem and how to create a DAG. This was a great learning experience and I hope you found this post useful.

DEV Community