If you work in Big Data, you’ve most likely heard of Apache Airflow. It started as an open-source project at Airbnb in 2014 to help the company handle its batch data pipelines. Since then, it has become one of the most popular open-source workflow management platforms within data engineering, receiving public contributions from organizations like Lyft, Walmart, and Bloomberg.
Apache Airflow is written in Python, which enables flexibility and robustness. Its powerful and well-equipped user interface simplifies workflow management tasks, like tracking jobs and configuring the platform. Since it relies on code to define its workflows, users can write what code they want to execute at every step of the process. Apache Airflow doesn’t enforce restrictions on how you arrange your workflows, making for a very customizable and desirable experience.
Thousands of companies use Apache Airflow and that number continues to grow. Today, we’re going to explore the basics of this popular tool, along with the fundamentals. We’ll also discuss the first steps you can take to get started.
- What is Apache Airflow?
- Why use Apache Airflow?
- Fundamentals of Apache Airflow
- How does Apache Airflow work?
- First steps for working with Apache Airflow
Apache Airflow is a robust scheduler for programmatically authoring, scheduling, and monitoring workflows. It’s designed to handle and orchestrate complex data pipelines. It was initially developed to tackle the problems that correspond with long-term cron tasks and substantial scripts, but it has grown to be one of the most powerful data pipeline platforms on the market.
We can describe Airflow as a platform for defining, executing, and monitoring workflows. We can define a workflow as any sequence of steps you take to achieve a specific goal. A common issue occurring in growing Big Data teams is the limited ability to stitch together related jobs in an end-to-end workflow. Before Airflow, there was Oozie, but it came with many limitations, but Airflow has exceeded it for complex workflows.
Airflow is also a code-first platform, designed with the idea that data pipelines are best expressed as code. It was built to be extensible, with available plugins that allow interaction with many common external systems, along with the platform to make your own platforms if you want. It has the capability to run thousands of different tasks per day, streamlining workflow management.
Airflow is used in many industries:
- Big Data
- Machine learning
- Computer software
- Financial services
- IT services
Listed below are some of the differences between Airflow and other workflow management platforms.
- Directed Acyclic Graphs (DAGs) are written in Python, which has a smooth learning curve and is more widely used than Java, which is used by Oozie.
- There’s a big community that contributes to Airflow, which makes it easy to find integration solutions for major services and cloud providers.
- Airflow is versatile, expressive, and built to create complex workflows. It provides advanced metrics on workflows.
- Airflow has a rich API and an intuitive user interface in comparison to other workflow management platforms.
- Its use of Jinja templating allows for use cases such as referencing a filename that corresponds to the date of a DAG run.
- There are managed Airflow cloud services, such as Google Composer and Astronomer.io.
In this section, we’ll look at some of the pros and cons of Airflow, along with some notable use cases.
- Open-source: You can download Airflow and start using it instantly and you can work with peers in the community.
- Integration with the cloud: Airflow runs well in cloud environments, giving you a lot of options.
- Scalable: Airflow is highly scalable up and down. It can be deployed on a single server or scaled up to large deployments with numerous nodes.
- Flexible and customizable: Airflow was built to work with the standard architecture of most software development environments, but its flexibility allows for a lot of customization opportunities.
- Monitoring abilities: Airflow allows for diverse ways of monitoring. For example, you can view the status of your tasks from the user interface.
- Code-first platform: This reliance on code gives you the freedom to write what code you want to execute at every step of the pipeline.
- Community: Airflow’s large and active community helps scale information and gives an opportunity to connect with your peers.
- Reliance on Python: While many find it to be a good thing that Airflow relies so heavily on Python code, those without much experience working with Python may have a steeper learning curve.
- Glitches: While Airflow is usually reliable, there can be glitches like any product.
Airflow can be used for nearly all batch data pipelines, and there are many different documented use cases, the most common being Big Data related projects. Here are some examples of use cases listed in Airflow’s Github repository:
- Using Airflow with Google BigQuery to power a Data Studio dashboard
- Using Airflow to help architect and govern a data lake on AWS
- Using Airflow to tackle the upgrading of producing while minimizing downtime
Now that we’ve discussed the basics of Airflow along with benefits and use cases, let’s dive into the fundamentals of this robust platform.
Workflows are defined using Directed Acyclic Graphs (DAGs), which are composed of tasks to be executed along with their connected dependencies. Each DAG represents a group of tasks you want to run, and they show relationships between tasks in Apache Airflow’s user interface.
Let’s break down the acronym:
- Directed: if you have multiple tasks with dependencies, each needs at least one specified upstream or downstream task.
- Acyclic: tasks aren’t allowed to produce data that self-references. This avoids the possibility of producing an infinite loop.
- Graph: tasks are in a logical structure with clearly defined processes and relationships to other tasks. For example, we can use a DAG to express the relationship between three tasks: X, Y, and Z. We could say, “execute Y only after X is executed, but Z can be executed independently at any time.” We can define additional constraints, like the number of retries to execute for a failing task and when to begin a task.
Note: A DAG defines how to execute the tasks, but doesn’t define what particular tasks do.
A DAG can be specified by instantiating an object of the
airflow.models.dag.DAG, as shown in the below example. The DAG will show in the UI of the web server as “Example1'' and will run once.
dag = DAG('Example1', schedule_interval='@once', start_date=days_ago(1),)
When a DAG is executed, it’s called a *DAG run*. Let’s say you have a DAG scheduled to run every hour. Each instantiation of that DAG establishes a DAG run. There can be multiple DAG runs connected to a DAG running at the same time.
Tasks are instantiations of operators and they vary in complexity. You can picture them as units of work that are represented by nodes in a DAG. They portray the work that is done at each step of your workflow with the actual work that they portray being defined by operators.
While DAGs define the workflow, operators define the work. An operator is like a template or class for executing a particular task.
All operators originate from
BaseOperator. There are operators for many general tasks, such as:
These operators are used to specify actions to execute in Python, MySQL, email, or bash.
There are three main types of operators:
- Operators that carry out an action or request a different system to carry out an action
- Operators that move data from one system to another
- Operators that run until certain conditions are met
Hooks allow Airflow to interface with third-party systems. With hooks, you can connect to outside databases and APIs, such as MySQL, Hive, GCS, and more. They’re like building blocks for operators. No secure information is contained in hooks. It’s stored within Airflow’s encrypted metadata database.
Note: Apache Airflow has community-maintained packages that include the core
hooksfor services such as Google and Amazon. These can be directly installed in your Airflow environment.
Airflow exceeds at defining complex relationships between tasks. Let’s say we want to designate that task
t1 executes before task
t2. There are four different statements we could use to define this exact relationship:
t1 >> t2
t2 << t1
There are four main components that make up this robust and scalable workflow scheduling platform:
The scheduler monitors all DAGs and their associated tasks. When dependencies for a task are met, the scheduler will initiate the task. It periodically checks active tasks to initiate.
The web server is Airflow’s user interface. It shows the status of jobs and allows the user to interact with the databases and read log files from remote file stores, like S3, Google Cloud Storage, Microsoft Azure blobs, etc.
The state of the DAGs and their associated tasks are saved in the database to ensure the schedule remembers metadata information. Airflow uses SQLAlchemy and Object Relational Mapping (ORM) to connect to the metadata database.
The scheduler examines all of the DAGs and stores pertinent information, like schedule intervals, statistics from each run, and task instances.
The executor decides how work gets done. There are different types of executors to use for different use cases.
Examples of executors:
SequentialExecutor: This executor can run a single task at any given time. It cannot run tasks in parallel. It’s helpful in testing or debugging situations.
LocalExecutor: This executor enables parallelism and hyperthreading. It’s great for running Airflow on a local machine or a single node.
CeleryExecutor: This executor is the favored way to run a distributed Airflow cluster.
KubernetesExecutor: This executor calls the Kubernetes API to make temporary pods for each of the task instances to run.
So, how does Airflow work?
Airflow examines all the DAGs in the background at a certain period. This period is set using the
processor_poll_interval config and is equal to one second. Once a DAG file is examined, DAG runs are made according to the scheduling parameters. Task instances are instantiated for tasks that need to be performed, and their status is set to
SCHEDULED in the metadata database.
The schedule queries the database, retrieves tasks in the
SCHEDULED state, and distributes them to the executors. Then, the state of the task changes to
QUEUED. Those queued tasks are drawn from the queue by workers who execute them. When this happens, the task status changes to
When a task finishes, the worker will mark it as failed or finished, and then the scheduler updates the final status in the metadata database.
Now that you know the basics of Apache Airflow, you’re ready to get started! A great way to learn this tool is to build something with it. After downloading Airflow, you can design your own project or contribute to an open-source project online.
Some interesting open-source projects:
- A plugin that lets you edit DAGs in your browser
- Docker Apache Airflow
- Dynamically generate DAGs from YAML config files
- And more
There’s still so much more to learn about Airflow. Some recommended topics to cover next are:
- Airflow sensors
To get started learning these topics, check out Educative’s course, An Introduction to Apache Airflow. This curated, hands-on course covers the building blocks of Apache Airflow, along with more advanced aspects, like XCom, operators and sensors, and working with the UI. You’ll master this highly coveted workflow management platform and earn a valuable certificate by the end.
The skills you gain in this course will give you a leg up on your journey!