Apache Airflow is a tool for describing, executing and monitoring workflows.
Here are some core concepts you need to know to become productive in Airflow:
In Airflow DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
Airflow will load any DAG object it can import from a DAG file. That means the DAG must appear in globals().
Default arguments are passed to a DAG as default_args dictionary. This makes it easy to apply a common parameter to many operators without having to type it many times.
DAGs can be used as context managers to automatically assign new operators to that DAG.
While DAGs describes how to run a workflow, Operators determine what actually gets done.
Operators do not have to be assigned to DAGs immediately (previously dag was a required argument). However, once an operator is assigned to a DAG, it can not be transferred or unassigned. DAG assignment can be done explicitly when the operator is created, through deferred assignment, or even inferred from other operators.
Airflow official document recommends that you should setup operator relationships with bitshift operators rather than
cross_downstream function provide easier ways to set relationships between operators in a specific situation.
A task is a parameterized instance of an operator
A task that 1) has been assigned to a DAG and 2) has a state associated with a specific run of the DAG
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators.
Some systems can get overwhelmed when too many processes hit them at the same time. Airflow pools can be used to limit the execution of parallelism on arbitrary sets of tasks. The list of pools is managed in the UI (Menu -> Admin -> Pools) by giving the pools a name and assigning it a number of worker slots.
The connection information to external systems is stored in the Airflow metadata database and managed in the UI (Menu -> Admin -> Connections).
Airflow also has the ability to reference connections via environment variables from the operating system. Then connection parameters must be saved in URI format.
If connections with the same conn_id are defined in both Airflow metadata database and environment variables, only the one in environment variables will be referenced by Airflow.
When using the CeleryExecutor, the Celery queues that tasks are sent to can be specified.
The default queue for the environment is defined in the airflow.cfg’s
celery -> default_queue. This defines the queue that tasks get assigned to when not specified, as well as which queue Airflow workers listen to when started.
Workers can listen to one or multiple queues of tasks. When a worker is started (using the command airflow worker), a set of comma-delimited queue names can be specified (e.g. airflow worker -q spark). This worker will then only pick up tasks wired to the specified queue(s).
XComs let tasks exchange messages, allowing more nuanced forms of control and shared state. The name is an abbreviation of “cross-communication”. XComs are principally defined by a key, value, and timestamp, but also track attributes like the task/DAG that created the XCom and when it should become visible. Any object that can be pickled can be used as an XCom value, so users should make sure to use objects of appropriate size.
Variables are a generic way to store and retrieve arbitrary content or settings as a simple key-value store within Airflow. Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI.
Sometimes you need a workflow to branch, or only go down a certain path based on an arbitrary condition which is typically related to something that happened in an upstream task. One way to do this is by using BranchPythonOperator.
SubDAGs are perfect for repeating patterns. Defining a function that returns a DAG object is a nice design pattern when using Airflow.
This SubDAG can then be referenced in your main DAG file:
Airflow has a simple plugin manager built-in that can integrate external features to its core by simply dropping files in your $AIRFLOW_HOME/plugins folder.