Decouple your DAGs with an event-driven architecture on AWS

#aws #eventdriven #serverless

Introduction

Applying domain-driven design and an event-driven architecture to the orchestration of our services has given our teams some very practical benefits in their day-to-day work on development and support.

The Problem

Up until recently, we ran our main scoring job in one big DAG running in Airflow. This DAG calls services developed and maintained by at least 3 separate teams. With this setup, we were tightly coupling our systems, our processes (on-call, support) and our development and technology choices.

In practice, here are a couple of real world problems we were running into:

1) The upstream team had several variations of their steps within the DAG. Each variation needed to have our teams steps copied and maintained in separate DAGs. Decoupling allows us to keep all our steps in a single DAG and to know where exactly our services are being orchestrated from.
2) The decision to use Airflow was made by the upstream team as it made sense for their skills and technologies. Decoupling will also allow us to use a technology that may be better suited to our teams skills and technologies. For example, we could move to Step Functions if we wanted. We will not be bound to another team or domains technology choices.
3) In addition to being on a mailing list for all alerts from the DAG, having to troubleshoot any failure may involve going through the larger DAG. While this may seem minor, situations like these can take a toll on a developer's productivity. Having our own separate DAG allows us to focus on our own services.

The Solution

Domain Driven Design

The first step was to identify the different domains represented within the larger DAG. From the outside, these may seem simple. The core services can be easy to identify but the boundaries are harder to identify. Where does one domain end and another begin? Our criteria for each domain was resolved around which team supported the service called. In addition, it was agreed the upstream domain was responsible for publishing an event when a material step in the DAG had completed. The downstream domain was responsible to consume that event. Using these guidelines, we were able to split the DAG out into 3 separate domains.

Communication between domains

We knew we needed to communicate between domains. This communication would involve more than just a marker to say that an event had happened. We also needed to pass some parameters between domains. These parameters were necessary to the execution of the end-to-end flow and needed to be passed from domain to domain.

The term event-driven has become ubiquitous in modern software development but what does it mean? What exactly is an event? According to AWS

An event is a change in state, or an update, like an item being placed in a shopping cart on an e-commerce website. Events can either carry the state (the item purchased, its price, and a delivery address) or events can be identifiers (a notification that an order was shipped).

Using this definition, we would able to use the event to pass information from one domain to another.

Technical solution

While our Airflow clusters are hosted on-premise, we decided early on that we wanted to use AWS services to publish and subscribe to events. We have an internal goal to host our services on AWS and to use a serverless service where we can. Ultimately, the SNS Fanout to SQS pattern fitted well to our requirements. For more information on this pattern, see this post on the AWS blog.

https://aws.amazon.com/blogs/compute/enriching-event-driven-architectures-with-aws-event-fork-pipelines/

This pattern allows us to separate out the publisher and subscriber into distinct services. The upstream service publishes an SNS topic with the event details. Each downstream service owns a separate SQS queue that subscribes to that topic. A JSON document can be passed between both services to communicate any necessary parameters.

1) This is the upstream Airflow DAG. Once it has passed a certain point, a JSON document is passed via API Gateway to an SNS topic.
2) SNS immediately informs all subscribers that a new event has been received. The JSON document is passed along to all subscribers.
3) In the downstream domain, an FIFO SQS queue is subscribed to the SNS topic.
4) The first step in the downstream DAG polls the SQS queue on a regular interval for messages using API Gateway. If a message is on the queue, the step validates the message to see if it is properly formed. If so, it kicks off the DAG with the parameters from the JSON document and deletes the message from the queue via API Gateway.

An obvious advantage of this design is that when multiple SQS queues can subscribe to the SNS topic without impacting on the upstream DAG or other subscribed SQS queues.

Note: No Lambdas were harmed in the development of this application. Serverless is about more than Lambda.

CDK

We used CDK to deploy our services. This construct is very similar to what we used.

https://constructs.dev/packages/@aws-solutions-constructs/aws-sns-sqs/v/1.120.0?lang=python

However, you will need to split out the SQS queue into the downstream domains code base parameterized with the name of the SNS topic. This is still a manual step for us but we are investigating the use of AWS Systems Manager Parameter Store to store and retrieve the name of relevant topic within the CI/CD process.

Summary

Utilizing AWS services to facilitate an event-driven architecture has been a game-changer for us. It is a relatively simple change in our case but provides several powerful benefits.

To find out more about how AWS can help you decouple your applications and take advantage of event driven architectures, check out this link:

https://aws.amazon.com/event-driven-architecture/

To check out the individual services used, use the links below:

https://aws.amazon.com/sqs/
https://aws.amazon.com/sns/
https://aws.amazon.com/api-gateway/