I see 5 main considerations to have when using airflow:
- What type of infrastructure to setup to support it
- What type of operator model to abide by, and which operators to choose
- How to architect your different DAGs and setup your tasks
- Whether to leverage templated code or not
- Whether and how to use it’s REST API
These considerations will dictate how you and your team will be using airflow and how it will be managed.
Setting up and maintaining airflow isn’t so easy, if you need to set it up, you will most likely need quite a bit more than the base image:
- Encryption needs to be setup to safely store secrets and credentials
- Setting up an authorization layer, if only through the flask login setup and preferably through an oAuth2 provider such as google
- SSL needs to be configured
- The web server needs to be moved to a more production ready setup (for example using wsgi/nginx)
- Libraries and drivers need to be installed to support the different types of operations you wish to handle
For the most simple use cases, it is possible to rely solely on the local executor, but once real processing need arise, more distributed computation need arise and management of the infrastructure becomes more complex.
They require also more resources to run than a Local executor setup, where worker, scheduler and web-server can lie in the same container:
- Celery executor: Webserver (UI), Redis (MQ), Postgres (Metadata), Flower (monitoring), Scheduler, Worker
- Mesos Executor: Webserver (UI), Redis (MQ), Postgres (Metadata), Mesos infra
- Kubernetes: Webserver (UI), Postgres (Metadata) and Scheduler, Kubernetes infra
The high number of components will raise the complexity, make it harder to maintain and debug problems requiring that one understand how the Celery executor works with Airflow or how to interact with Kubernetes.
Managed version of airflow exists on Google Cloud, through Cloud Composer, and Astronomer.io also offers managed versions, qubole offers it as part of its’ data platform. Where applicable it is more than recommended to go for a managed version than setting up and managing this infrastructure yourself.
Depending on your use case, you might want to be able to use certain sensor , hooks or operator. And while airflow has a decent support for the most common operators, and good support on google cloud. If you have a more uncommon, use case you will probably need to check in user contributed operators list or develop your own.
Understanding how to use operators depending on your particular company setup is also important. Some, have a radical stance with respect to operator, but the reality is that the use of operators need to be taken in the context of your company.
- Does your company have an engineering bias that supports the use of Kubernetes or other container style instances?
- Is your company use of airflow, more driven by your Datas-Science department, with little engineering support? For them, it might make more sense to use a python operator or the still pending R operator
- Is your company only planning to use airflow to operate data transfers (Sftp/S3 …) and SQL queries to maintain a data-warehouse? For them using K8s or any container instances would be overkill. This is for example the approach taken at Fetchr, where most of the processing is done in ERM/Presto.
Selecting your operator setup is not a one size fit all.
There are quite a few ways to architect your DAGS in airflow, but as a general rule it is good to keep them simple. Keep within the DAGS tasks that are truly dependent on each other, when dealing with multiple DAGS dependencies abstract them into another DAG and file.
When dealing with lot of data-sources and interdependencies, things can get messy, and setting up dags as self-contained files, kept as simple as possible can go a long way to make your code maintainability. The external task sensor, helps to separate DAG and their dependencies in multiple self contained DAGS.
As in most distributed system it is important to setup operation as idempotent as possible — at least within a Dag Run. Certain operations between dag runs may rely on a depend on past settings.
Sub-DAGS, should be used with parsimony for the same reason of code maintainability. One of the only valid reason for me in using Sub-DAGS is for the creation of Dynamic DAGS.
Communication between tasks, although possible with XCom should be minimized as much as possible in favor of self containing functions/operators, this makes the code more legible, stateless and unless you want to be able to only re-run this part of the operation do not justify the use of these. Dynamic Dags are one of the notable exception to this.
Airflow leverages jinja for templating. Commands such as Bash or SQL command can easily be templated, for execution with variables fitted or computed by the context. Templates can provide a more readable alternatives to direct string manipulation in python (eg: through a format command). JinJa templates is the default templating engine of most Flask developers, and can also provide a good bridge for python web developers getting into data.
Macros provides a way to take further advantage of templating by exposing objects and functions to the templating engine. User can leverage a set of default macros, or customize theirs at global or DAG level.
Using templated code does however take you away from vanilla python and exposes one more layer of complexity, for engineers typically needing to leverage quite a large array of technologies and apis.
Whether or not you choose to leverage template is a team/personal choice, there are more traditional ways to obtain the same results, wrapping the same in python format commands for example, but it can make the code more legible.
Airflows’ REST Api, allow for the creation of event driven driven workflows. The key feature of the API, is to let you trigger DAGS runs with specific configuration:
The rest API allow for building, data product applications built on top of airflow, with use cases such as:
- Spanning out clusters and processing based on anhttp request
- Setting up a workflow based on a message or file appearing in respectively a message topic or blog storage
- Building fulling fledge Machine Learning platforms.
Leveraging the Rest API allows for the construction of complex asynchronous processing patterns, while re-using the same architecture, platform and possibly code that are used for more traditional data processing.
More from me on Hacking Analytics:
- One the evolution of Data Engineering
- Overview of efficiency concepts in Big Data Engineering
- Setting up Airflow on Azure & connecting to MS SQL Server
- Airflow, the easy way
- E-commerce Analysis: Data-Structures and Applications