In continuation from a series of posts, where I have explained the basics of airflow and how to setup airflow on azure, and what considerations to have when using airflow, I wanted to cover in details what makes airflow a great tool to use for data processing.
Dags are a way to setup workflows, they can setup a sequence of operations that can be individually retried on failure and restarted where the operation failed. Dags provide a nice abstraction to a series of operations.
Dynamics Dags can for instance be setup based on variables or connections defineed within the Airflow UI.
Airflow has a lot of operators setup to run code. Airflow has operator for most databases and being setup in python it has a PythonOperator that allow for quickly porting python code to production.
Papermilll is an extension to jupyter notebook, allowing the parametrization and execution of notebooks, it is supported through airflow PapermillOperator. Netflix notably has suggested a combination or airflow and papermill to automate and deploy notebook in production:
It is extremely good at managing different sort of dependencies, be it a task completion, dag runs status, file or partition presence through specific sensor. Airflow also handles task dependency concept such as branching.
It is fully extendable through the development of custom sensors, hooks and operators. Airflow notably benefits from a large amount of community contributed operators.
Airflow provides a monitoring and managing interface, where it is possible to have a quick overview of the status of the different tasks, as well as have the possibility to trigger and clear tasks or DAGs runs.
It has an auto-retry policy built-in, configurable through :
- retries: number of retries before failing the task
- retry_delays: (timedelta) delay between retries
- retry_exponential_backoff: (Boolean) to setup an exponential backoff between retries
- max_retry_delay: Maximum delay (timedelta) between retgries
These arguments can be passed through the context to any operator, as they are supported by the BaseOperator class.
Airflow provides an easy access to the logs of each of the different tasks run through its web-ui, making it easy to debug tasks in production.
Airflow’s API allows to create workflows from external sources, and to be data product on top of it:
The rest API, allows to use the same paradigm used to built pipelines, to create asynchronous workflows, such ascustom machine learning training operations.
It provides a default alerting system on tasks failed, email is the default, but alerting through slack can be set up using a callback and the slack operator:
More from me on Hacking Analytics:
- One the evolution of Data Engineering
- Overview of efficiency concepts in Big Data Engineering
- Setting up Airflow on Azure & connecting to MS SQL Server
- Airflow, the easy way