Julien Kervizic

Posted on Oct 18, 2019 • Originally published at Medium on May 14, 2019

5 Considerations to have when using Airflow

#analytics #machinelearning #data #dataengineering

5 considerations to have when using Airflow

In previous posts, I have explained the basics of airflow and how to setup airflow on azure, I haven’t however covered what considerations we should give, when using Airflow.

I see 5 main considerations to have when using airflow:

What type of infrastructure to setup to support it
What type of operator model to abide by, and which operators to choose
How to architect your different DAGs and setup your tasks
Whether to leverage templated code or not
Whether and how to use it’s REST API

These considerations will dictate how you and your team will be using airflow and how it will be managed.

(1) Airflow Infrastructure — Go for a Managed Service if Possible

Setting up and maintaining airflow isn’t so easy, if you need to set it up, you will most likely need quite a bit more than the base image:

Encryption needs to be setup to safely store secrets and credentials
Setting up an authorization layer, if only through the flask login setup and preferably through an oAuth2 provider such as google
SSL needs to be configured
The web server needs to be moved to a more production ready setup (for example using wsgi/nginx)
Libraries and drivers need to be installed to support the different types of operations you wish to handle
…

For the most simple use cases, it is possible to rely solely on the local executor, but once real processing need arise, more distributed computation need arise and management of the infrastructure becomes more complex.

They require also more resources to run than a Local executor setup, where worker, scheduler and web-server can lie in the same container:

Celery executor: Webserver (UI), Redis (MQ), Postgres (Metadata), Flower (monitoring), Scheduler, Worker
Mesos Executor: Webserver (UI), Redis (MQ), Postgres (Metadata), Mesos infra
Kubernetes: Webserver (UI), Postgres (Metadata) and Scheduler, Kubernetes infra

The high number of components will raise the complexity, make it harder to maintain and debug problems requiring that one understand how the Celery executor works with Airflow or how to interact with Kubernetes.

Managed version of airflow exists on Google Cloud, through Cloud Composer, and Astronomer.io also offers managed versions, qubole offers it as part of its’ data platform. Where applicable it is more than recommended to go for a managed version than setting up and managing this infrastructure yourself.

(2) Sensors, Hooks and Operators — Find your fit

Depending on your use case, you might want to be able to use certain sensor , hooks or operator. And while airflow has a decent support for the most common operators, and good support on google cloud. If you have a more uncommon, use case you will probably need to check in user contributed operators list or develop your own.

Understanding how to use operators depending on your particular company setup is also important. Some, have a radical stance with respect to operator, but the reality is that the use of operators need to be taken in the context of your company.

Does your company have an engineering bias that supports the use of Kubernetes or other container style instances?
Is your company use of airflow, more driven by your Datas-Science department, with little engineering support? For them, it might make more sense to use a python operator or the still pending R operator
Is your company only planning to use airflow to operate data transfers (Sftp/S3 …) and SQL queries to maintain a data-warehouse? For them using K8s or any container instances would be overkill. This is for example the approach taken at Fetchr, where most of the processing is done in ERM/Presto.

Selecting your operator setup is not a one size fit all.

(3) DAGS — Keep them simple

There are quite a few ways to architect your DAGS in airflow, but as a general rule it is good to keep them simple. Keep within the DAGS tasks that are truly dependent on each other, when dealing with multiple DAGS dependencies abstract them into another DAG and file.

When dealing with lot of data-sources and interdependencies, things can get messy, and setting up dags as self-contained files, kept as simple as possible can go a long way to make your code maintainability. The external task sensor, helps to separate DAG and their dependencies in multiple self contained DAGS.

As in most distributed system it is important to setup operation as idempotent as possible — at least within a Dag Run. Certain operations between dag runs may rely on a depend on past settings.

Sub-DAGS, should be used with parsimony for the same reason of code maintainability. One of the only valid reason for me in using Sub-DAGS is for the creation of Dynamic DAGS.

Communication between tasks, although possible with XCom should be minimized as much as possible in favor of self containing functions/operators, this makes the code more legible, stateless and unless you want to be able to only re-run this part of the operation do not justify the use of these. Dynamic Dags are one of the notable exception to this.

(4) Templates and Macros — Legible Code

Airflow leverages jinja for templating. Commands such as Bash or SQL command can easily be templated, for execution with variables fitted or computed by the context. Templates can provide a more readable alternatives to direct string manipulation in python (eg: through a format command). JinJa templates is the default templating engine of most Flask developers, and can also provide a good bridge for python web developers getting into data.

Macros provides a way to take further advantage of templating by exposing objects and functions to the templating engine. User can leverage a set of default macros, or customize theirs at global or DAG level.

Using templated code does however take you away from vanilla python and exposes one more layer of complexity, for engineers typically needing to leverage quite a large array of technologies and apis.

Whether or not you choose to leverage template is a team/personal choice, there are more traditional ways to obtain the same results, wrapping the same in python format commands for example, but it can make the code more legible.

(5) Event Driven — REST API for building Data Products

Airflows’ REST Api, allow for the creation of event driven driven workflows. The key feature of the API, is to let you trigger DAGS runs with specific configuration:

The rest API allow for building, data product applications built on top of airflow, with use cases such as:

Spanning out clusters and processing based on anhttp request
Setting up a workflow based on a message or file appearing in respectively a message topic or blog storage
Building fulling fledge Machine Learning platforms.

Leveraging the Rest API allows for the construction of complex asynchronous processing patterns, while re-using the same architecture, platform and possibly code that are used for more traditional data processing.

DEV Community

5 Considerations to have when using Airflow

5 considerations to have when using Airflow

(1) Airflow Infrastructure — Go for a Managed Service if Possible

(2) Sensors, Hooks and Operators — Find your fit

(3) DAGS — Keep them simple

(4) Templates and Macros — Legible Code

(5) Event Driven — REST API for building Data Products

Top comments (0)

Read next

SUBSTR function in Oracle

🛡️ Building Safe and Responsible AI with Amazon Bedrock Guardrails 🛡️

How to Run Samurai on Google Colab

New Voice Command System Tackles Variable-Length Speech for Improved Live Transcription