TL;DR: Airflow is robust and flexible, but complicated. If you are just starting to schedule data tasks, you may want to try more tailored solutions:
- Moving data into a warehouse: Stitch
- Transforming data within a warehouse: DBT
- Scheduling Python scripts: Databricks
- Batch scoring ML models: Booklet.ai
How could using 4 different services be easier than using just one?
Apache Airflow is one of the most popular workflow management tools for data teams. It is used by hundreds of companies around the world to schedule jobs of any kind. It is a completely free, open source project, and offers amazing flexibility with its python-built infrastructure.
I’ve used (and sometimes set up) Airflow instances of all sizes: from Uber’s custom-built Airflow-based Piper to small instances for side projects and there is one theme in common: projects get complicated, fast! Airflow needs to be deployed in a stable and production-ready way, all tasks are custom-defined in Python, and there are many watch-outs to keep in mind as you are building tasks. For a less technical user, Airflow can be an overwhelming system just to schedule some simple tasks.
Although it may be tempting to use one tool for all of the different scheduling needs, that may not always be your best choice. You’ll end up building custom solutions every time a new use-case comes up. Instead, you should use the best tool for the job you are trying to accomplish. The time saved during the setup and maintenance for each use-case is well worth adding a few more tools to your data stack.
In this post, I’ll outline a few of the use cases for Airflow and alternatives for each.
Disclaimer: I am one of the founders of Booklet.ai.
Data teams need something to do their jobs…. Data! Many times there are multiple different internal and external sources for this data across disparate sources. To pull all of this data into one single place, the team needs to extract this data from all of these sources and plug them into one single location. This is usually a data warehouse of some kind.
For this step, there are many reliable tools that are being used around the globe. They extract data from a given set of systems on a regular cadence and send those results directly to a data warehouse. These systems handle most error handling and keep things running smoothly. Managing multiple, complicated integrations can prove to be a maintenance nightmare, so these tools can save a lot of time. Luckily, there is a nightmare-saving option:
Stitch coins itself as “a cloud-first, open source platform for rapidly moving data.” You can quickly connect to databases and third party tools and send that data to multiple different data warehouses. The best part: the first 5 million rows are free! Stitch can also be extended with a few open source frameworks.
Once that data is loaded up into a data warehouse, it’s usually a mess! Every source had a different structure and each dataset is probably indexed in a different way with a different set of identifiers. To make sense of this chaos, the team needs to transform and join all of this data into a nice, clean form that is easier to use. Most of the logic will happen directly within the data warehouse.
The process of combining all of these datasets into a form that a business can actually use can be a tedious task. It has become such a complex field, that the specific role of Analytics Engineer has emerged from it. These are a common set of problems across the industry and a tool has emerged to solve these problems specifically:
DBT considers itself “your entire analytics engineering workflow” and I agree. With only knowing SQL, you can quickly build multiple, complex layers of data transformation jobs that will be fully managed. Version control, testing, documentation and so much else is all managed for you! The cloud-hosted version is free to use.
Sometimes the team will also need to transform data outside of the data warehouse. What if the transformation logic can’t operate completely within SQL? What if the team needs to train a Machine Learning model? These tasks might pull data from the data warehouse directly, but the actual tasks need to operate in a different system, such as python.
Most custom python-based scripts usually start as a Jupyter Notebook. You import a few packages, import or extract data, run some functions, and finally push that data somewhere else. Sometimes more complicated, production-scale processes are needed, but that’s rare. If you just need a simple way to run and schedule a python notebook there’s a great option:
Databricks was created by the original founders of Spark. Its main claim-to-fame is spinning up spark clusters super easily, but it also has great notebook functionality. They offer easy to use Python notebooks, where you can collaborate within the notebook just like Google docs. Once you develop the script that works for you, you can schedule that notebook to run completely within the platform. It’s a great way to not worry about where the code is running and have an easy way to schedule those tasks. They have a free community edition.
If the team has built a machine learning model, the results of that model should be sent to a place where it can actually help the business. These tasks usually involve connecting to an existing Machine Learning model and then sending the results from that model to another tool, such as a Sales or Marketing tool. This can be a ridiculously tedious task to get a system up and running that actually pushes the correct model results at the right time.
Building a machine learning model is hard enough, it shouldn’t take another 2 months of custom coding to connect that model to a place where the business can find value in it. This work usually requires painful integration to third party systems, not to mention the production-level infrastructure work that is required! Thankfully, there is a solution that handles some of these tasks for you:
Booklet.ai connects to existing ML Models and then allows you to quickly set up a few things: A demo to share the model with non technical users, an easy API endpoint connection, and a set of integrations to connect inputs and outputs. You can easily set up an input query from a data warehouse to score the model, and then send those results to a variety of tools that your business counterparts may use. You can check out a demo for a lead-scoring model that sends results to intercom . You can request access to the Booklet.ai beta , where your first model will be free.