In continuation from a series of posts, where I have explained the basics of airflow and how to setup airflow on azure, and what considerations to have when using airflow, I wanted to cover in details what makes airflow a great tool to use for data processing.
1. DAGs :
Dags are a way to setup workflows, they can setup a sequence of operations that can be individually retried on failure and restarted where the operation failed. Dags provide a nice abstraction to a series of operations.
2. Programmatic Workflow Management:
Airflow provide a way to setup programmatic workflows, Tasks for instances can be generated on fly within a dag. While Sub-DAGs and XCom, allows to create complex dynamic workflows.
Dynamics Dags can for instance be setup based on variables or connections defineed within the Airflow UI.
3. Automate your Queries, Python Code or Jupyter Notebook
Airflow has a lot of operators setup to run code. Airflow has operator for most databases and being setup in python it has a PythonOperator that allow for quickly porting python code to production.
Papermilll is an extension to jupyter notebook, allowing the parametrization and execution of notebooks, it is supported through airflow PapermillOperator. Netflix notably has suggested a combination or airflow and papermill to automate and deploy notebook in production:
Part 2: Scheduling Notebooks at Netflix
4. Task Dependency management:
It is extremely good at managing different sort of dependencies, be it a task completion, dag runs status, file or partition presence through specific sensor. Airflow also handles task dependency concept such as branching.
Use conditional tasks with Apache Airflow
5. Extendable model:
It is fully extendable through the development of custom sensors, hooks and operators. Airflow notably benefits from a large amount of community contributed operators.
Operators in different programming languages such as R [AIRFLOW-2193] are being built in using python wrappers, in the future other programing language such as Javascript which also have python wrapper (pyv8) could also be created.
6. Monitoring and management interface:
Airflow provides a monitoring and managing interface, where it is possible to have a quick overview of the status of the different tasks, as well as have the possibility to trigger and clear tasks or DAGs runs.
7. Retry policy built in:
It has an auto-retry policy built-in, configurable through :
- retries: number of retries before failing the task
- retry_delays: (timedelta) delay between retries
- retry_exponential_backoff: (Boolean) to setup an exponential backoff between retries
- max_retry_delay: Maximum delay (timedelta) between retgries
These arguments can be passed through the context to any operator, as they are supported by the BaseOperator class.
8. Easy interface to interact with logs:
Airflow provides an easy access to the logs of each of the different tasks run through its web-ui, making it easy to debug tasks in production.
9. Rest API:
Airflow’s API allows to create workflows from external sources, and to be data product on top of it:
Using Airflow Experimental Rest API on Google Cloud Platform: Cloud Composer and IAP
The rest API, allows to use the same paradigm used to built pipelines, to create asynchronous workflows, such ascustom machine learning training operations.
10. Alerting system:
It provides a default alerting system on tasks failed, email is the default, but alerting through slack can be set up using a callback and the slack operator:
Integrating Slack Alerts in Airflow
More from me on Hacking Analytics:
- One the evolution of Data Engineering
- Overview of efficiency concepts in Big Data Engineering
- Setting up Airflow on Azure & connecting to MS SQL Server
- Airflow, the easy way
Top comments (0)