DEV Community

loading...
Cover image for Apache Airflow Core Concepts

Apache Airflow Core Concepts

Zahidul Islam
Data Engineer with 10+ years of experience in Full Stack development
・4 min read

Apache Airflow is a tool for describing, executing and monitoring workflows.

Here are some core concepts you need to know to become productive in Airflow:

DAG (Directed Acyclic Graph)

In Airflow DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

Alt Text

Alt Text

Scope

Airflow will load any DAG object it can import from a DAG file. That means the DAG must appear in globals().

Alt Text

Default Arguments

Default arguments are passed to a DAG as default_args dictionary. This makes it easy to apply a common parameter to many operators without having to type it many times.

Alt Text

Context Manager

DAGs can be used as context managers to automatically assign new operators to that DAG.

Alt Text

Operators

While DAGs describes how to run a workflow, Operators determine what actually gets done.

Alt Text

DAG Assignment

Operators do not have to be assigned to DAGs immediately (previously dag was a required argument). However, once an operator is assigned to a DAG, it can not be transferred or unassigned. DAG assignment can be done explicitly when the operator is created, through deferred assignment, or even inferred from other operators.

Alt Text

Bitshift Composition

Airflow official document recommends that you should setup operator relationships with bitshift operators rather than set_upstream() and set_downstream()

Alt Text

Relationship Helper

chain and cross_downstream function provide easier ways to set relationships between operators in a specific situation.

Alt Text

Tasks

A task is a parameterized instance of an operator

Task Instances

A task that 1) has been assigned to a DAG and 2) has a state associated with a specific run of the DAG

Hooks

Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators.

Alt Text

Pools

Some systems can get overwhelmed when too many processes hit them at the same time. Airflow pools can be used to limit the execution of parallelism on arbitrary sets of tasks. The list of pools is managed in the UI (Menu -> Admin -> Pools) by giving the pools a name and assigning it a number of worker slots.

Alt Text

Connections

The connection information to external systems is stored in the Airflow metadata database and managed in the UI (Menu -> Admin -> Connections).

Airflow also has the ability to reference connections via environment variables from the operating system. Then connection parameters must be saved in URI format.

If connections with the same conn_id are defined in both Airflow metadata database and environment variables, only the one in environment variables will be referenced by Airflow.

Queues

When using the CeleryExecutor, the Celery queues that tasks are sent to can be specified.

The default queue for the environment is defined in the airflow.cfg’s celery -> default_queue. This defines the queue that tasks get assigned to when not specified, as well as which queue Airflow workers listen to when started.

Workers can listen to one or multiple queues of tasks. When a worker is started (using the command airflow worker), a set of comma-delimited queue names can be specified (e.g. airflow worker -q spark). This worker will then only pick up tasks wired to the specified queue(s).

XComs

XComs let tasks exchange messages, allowing more nuanced forms of control and shared state. The name is an abbreviation of “cross-communication”. XComs are principally defined by a key, value, and timestamp, but also track attributes like the task/DAG that created the XCom and when it should become visible. Any object that can be pickled can be used as an XCom value, so users should make sure to use objects of appropriate size.

Alt Text

Variables

Variables are a generic way to store and retrieve arbitrary content or settings as a simple key-value store within Airflow. Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI.

Alt Text

Branching

Sometimes you need a workflow to branch, or only go down a certain path based on an arbitrary condition which is typically related to something that happened in an upstream task. One way to do this is by using BranchPythonOperator.

Alt Text

SubDAGs

SubDAGs are perfect for repeating patterns. Defining a function that returns a DAG object is a nice design pattern when using Airflow.

Alt Text

This SubDAG can then be referenced in your main DAG file:

Alt Text

Plugins

Airflow has a simple plugin manager built-in that can integrate external features to its core by simply dropping files in your $AIRFLOW_HOME/plugins folder.

Alt Text

References

Discussion (0)