The idea behind this series post is to try to demonstrate a somewhat realistic example of how to structure a back-end architecture using simple Azure serverless offers. This project will implement a realistic business application with extensive use of CQRS and Event Sourcing concepts as well as some more advanced patterns like a realistic implementation of a Process Manager. Most components of this architecture are designed with reactivity in mind, in the sense that each component is called during a reaction to a formal message arriving on the network. As it will be explained later, Azure provides excellent tools to help you leverage reactive programming on the cloud. This architecture is inherently asynchronous and message-based.
We will try to show a viable way to start new projects on the cloud using serverless products from Azure. Capitalizing on the serverless offers of Azure can be a huge time saver and cost saver to teams that need to rapidly create modern, scalable, and elastic solutions. This is particularly interesting for startups and teams wanting to deliver flexible and easy to understand solutions on the cloud;
This article will give you directions to help you achieve a viable serverless event-driven, reactive architecture on Azure. We will strive to inspire your team to consider a serverless approach to your next project.
We try to explain the following concepts and techniques:
- Motivation of why event-driven architecture is an advantage in the cloud
- Example of CQRS, Commands, Events, Command Handlers implementations
- Reactive techniques on Azure
- Using CosmosDB as a genuine Event Store of the whole system. We demonstrate the concept of an event store, global state, projections and what to do when an event arrives
- Propagation of events to other consumers and yet unknown consumer using Event Grid and a worked example of a service that reacts to external events and depends on manual intervention
- Detailed implementation of a Process Manager that orchestrates a process and maintains state on the cloud
- Examples of retry logic and Dead Lettering of messages
- Deep usage of Azure Functions and Durable functions
- Monitoring using Application Insights and troubleshooting of problems
To avoid the excessive complexity of this example, won’t be covered:
- API Management, Logic Apps, Service Bus and other related offers as we wanted to keep things simple
- Event storming and DDD modeling techniques in detail
- Unit Testing techniques
- Pipelines of Continuous Integration or Continuous Deployment, Terraform or ARM
- Front-end code. We have included a simulator that can prepare the system for the desired scenarios using a console application.
- Data and AI integrations
- GraphQL and gRPC
We plan to cover some of these items in the future.
Synchronous systems are systems where the reaction to actions (started or not by the user) are immediately processed, persisted, and returned to the requestor. This is a very natural and common way to implement common business applications and we have seen this kind of implementation everywhere. Monoliths are normally implemented using synchronous strategies, usually, with a central database to store the global state of the whole application. Reads and writes happen in the same location, competing for network and storage resources. Improvements in the storage resources are hard to achieve and the database must be very optimized to deliver good results. Database scaling is frequently expensive.
Since the global state is shared it is often difficult to implement or modify existing tables or fields in the database. It`s always complicated to find the systems impacted by a single change. Adding bigger features that require deeper changes in table structures is often impossible or very hard to implement without long negotiations. Another disadvantage of monolithic architectures is the possibility of resource starvation, timeouts, and failure of integrations to 3rd party systems with the potential affect the critical path of the application. User experience can easily be degraded. The whole system becomes more sensitive to issues during the processing of the requests. These scenarios are the major motivation to migrate to microservices, along with the possibility to divide the development work among many independent teams. There is a common expectation that just the migration to a microservices architecture is enough to achieve a more flexible architecture and more agile software development and delivery process. Not necessarily true.
To fully use the potential of the cloud services, we should separate the system in components that can do well one task. A monolithic approach would not work when we must scale different parts of the system at different times. Separating the system in smaller pieces will lead us to design the system that inevitably requires us to use microservices approach to achieve success. We design thinking in services that can execute individual work with minimum dependencies on other parts of the system. Each service can be scaled individually, can fail, can be stopped, and can be changed with very limited effects on the other parts of the system. Potentially, the maintenance of one part should not affect any other part.
We also must be aware that using a serverless approach, we can have the benefit of only being charged proportionally to use and we can decommission parts of the system that don`t need to be active at all times. Another advantage is that many services are billed by the number of messages processed. On Azure, many services are free for the first thousands or millions of messages processed per month. This is an important cost saver compared to more traditional approaches or even, container orchestrator approaches that imply a fixed minimum cost to run the business due to the need to have virtual machines running 24 x 7.
An event-driven approach is well suited for microservices communication in the cloud. We embrace asynchronism. With the use of messaging it is possible to separate producers and consumers so we can scale them individually. In the context of event-driven architectures, we usually classify messages as Commands, Queries, Events, and Notifications. We tend to see the dynamics of the system as a sequence of commands and events ordered in time. Consequently, modern systems can often be understood as a stream of events. Azure has many offers that can help in business apps, IoT, Data+AI, and more. We cover in this example a realistic use case that uses an event-driven strategy to deliver a business application.
Doing event-driven architecture on the cloud has a very interesting consequence: the cloud itself can work as a reactor. The cloud assumes the responsibility to take a given message and deliver it to its handler without the consumer knowing anything about how the input arrived. The consumers in this kind of architecture passively receive the messages and react to them. A reactive service does not know anything about the messaging infrastructure, does not have code related to messaging itself. Reactive services are simpler to understand and develop.
DDD is a very important tool to design cloud-native systems. In the world of distributed systems, it's very difficult to manage business rules, to manage transactional boundaries, and to understand the big picture. We must design the systems considering commands, events, and bounded contexts from Day 0. The development of the application should anticipate the domain model from the beginning and be prepared for the evolution of the project.
We will cover the implementation using Azure serverless offerings and we will go in deep detail to illustrate how anyone can start using an event-driven architecture.
We are extensively using Cosmos DB, Azure Functions, and Event Grid as the main cloud components of our architecture. We also use in this example storage accounts and Durable Functions. Azure Functions is the best way to run your code on the cloud when it`s needed to react HTTP calls, changes in Cosmos DB collections, scheduled calls, and more. Azure Functions offers a sophisticated structure of bindings and triggers that allows Azure Functions to be an intelligent glue for various Azure services. A function must execute fast and should not maintain state. Using a stateless approach, we can easily create functions that can be scaled proportionally to the number of calls that need to be processed. That means that Azure can create multiple instances of a function and consume many messages as needed. This is a very important reason to keep parts of the process separated in different functions and potentially, different function apps.
Cosmos DB is often viewed as an expensive way to store data on the cloud. But there have been lately many efforts to reduce the costs. Use cases with a smaller number of RUs can now be favored. Cosmos DB, as we are speaking in April/May 2020, can now use database-level pricing instead of collection-level pricing. Now it’s possible to share RU/s and reserve RU/s for specific collections when necessary. This is a very important price reduction as not all collections have intensive read/write needs. Since March 2020, Cosmos DB offers a free tier for the first 400 RU/s and 5GB in the month and lets you share the free tier with 25 containers of the database. This is a huge cost saver for small use cases. Smaller microservices can now consider the use of Cosmos DB as their primary storage engine. We do this in this project.
When starting a project from scratch, there is always room to analyze if the storage model should be relational or non-relational. Of course, it depends a lot on the workload of the app to be created. But in general, in the industry, we usually see that startups tend to adopt the NoSQL model because it’s far easier to design a document database than a complete relational model. Startups are commonly discovering a market fit and are changing and adding features to their products.
In this example, we don`t make use of a relational database, even though there is an Azure offering for SQL Server that is serverless. We stick with Cosmos DB because we are creating a project from scratch that does not need a relational database and all its complexity. Another important feature that Cosmos DB offers is the Change Feed. This feature is a key enabler of event-driven architectures. The Change Feed calls an Azure Function when a new item is added or modified on a given container. This is nice because a function now can be reactively called by an insertion on a Collection. Cosmos DB is a reactor on our system, sending messages in the order they arrive. One or more listeners react to each message arrived. This mechanism is very analogous to Apache Kafka – a producer adds a message to the stream and potentially some consumer processes the message. The message is not deleted after the consumption, enabling a scenario where multiple consumers can process their messages in the speed they want/can process. The Change Feed is just a list of messages that can be read from the beginning at any time – the consumer decides what is the next item to process. Azure Functions automates the reading of the Change Feed and the management of the “cursor” that marks the position of the message in the feed.
An event store is just a database that stores events in an architecture that uses Event Sourcing. Cosmos DB can be seen as a helpful event store in the system, since it stores the messages and can reactively notify Azure Functions about new items added. The Azure Function that reacts to a message on CosmosDB is a projector and optionally, it can update the read model as a reaction to an event. The read model is just a projection of the global state stored on the event store. This means that we can potentially delete the read model and do a reconstruction using the data in the event store. We can even have multiple projections – multiple read models. In our project, we update the read model a message arrives every time.
Event Grid is a serverless offering designed to send individual messages to applications and assure it is really sent or dead lettered. The use case for Event Grid is to be a glue that connects many event sources and event consumers, carrying messages that have much more value and can cause problems if lost. It’s an enabler of reactive scenarios in the cloud. One great aspect is that it connects to Azure Functions and Webhooks. This integration option can allow any app that exposes a webhook to simply be called by Event Grid. For example, if an event handler exposes a webhook, it can readily react to an event published by our system with Event Grid’s help without any modification in code. Of course, we should note that Event Grid can listen to events published internally by many Azure components.
Now let’s explore how the backend of our app is designed. We have an example of a very common scenario where the front-end wants to trigger a change in the global state of the system:
The diagram of the figure above shows what happens when a request is sent from the client. We consider as a client in this scenario any client-side technology like mobile, web, or desktop application. The bullets in the diagram show the steps taken when a user action begins.
Step 1 – Preparation of the request: The client needs to perform an action that is expected to change the state of the system. In our example, the user wants to send some credits to his son. A mobile app will have run a query to obtain the account details of his son. This is done by sending a query to the read side of the application. The read side of the application, in the sense of CQRS, is a database and code optimized for reading that is physically apart from the database used to store data on the "write side". The client app sends the data over the wire to an HTTP endpoint. It`s important to have a read model updated as soon as possible.
Step 2 – Request Validation and Command Construction: The Azure function receives the request and validates the body of the request, validating the structure of the request. More complex validation can require the use of the read model to help. After the validation of the request, the command is constructed and is sent to a Cosmos DB collection. The command is not immediately handled as it would be expected in other architectures. We avoid handling the command inside the function that processes network requests because this regard the HTTP endpoint works as a producer of messages (commands) that will be handled by different command handlers. We decouple the command construction from the command handling in code and deployment dimensions. Another important issue about command processing is the fact that it is possible to receive potentially duplicate commands. The command store does not allow commands with the same hashes and therefore, we are protected from duplications and related issues.
Step 3 – Handling Commands: The commands are saved to a collection. The command handler logic is triggered during a reaction to a command being published on the system. If the command fails, the client should be notified. If the command succeeds, the client should not be notified yet. The event should be published first. In general, each command handler produces one or more events. The command handling should have a retry mechanism since transient failures could result in a command failure that can be avoided with a simple retry.
Step 4 – Persisting events on event store: When the command is processed with success, the command handler publishes on or more events that represent a change in the global state. To be clear, a successful insertion in Cosmos DB Events collection means that this event has become public and can’t be removed or changed anymore. It’s a fact. The event, published on CosmosDB by the mechanism of Change Feed + Azure Functions, can now be processed further.
Step 5 – Primary Event Handlers: The only way to process documents using the Change Feed using code and serverless technology is by using an Azure Function. So, it’s mandatory to have an Azure Function listening to events that are published on our event store if we want to react to them. And it’s also the only way to publish Cosmos DB changes to Event Grid. Since this function is mandatory, we call this function a Primary Event Handler. In our project, we also use this function to continuously update the read model of the application so the client can see the results of their request as soon as possible. We call this a primary reaction. After this primary reaction to this event is completed, we can send the event to Event Grid and let other event handlers work. A primary reaction should always be a crucial use case of the system, usually, when the user must see the results of their actions immediately. In our example, all reactions that required fast change on the user’s balance are done in primary event handlers so the clients can be notified of a change in the balance as soon as possible and therefore, reduce the chance of further command rejections due to incorrect balance. In the next post, we show the details of the business problem and an implementation strategy.
Step 6 – Read Model updated: The read model is a projection of the global state. As said before, the global state is just an immutable sequence of events. Nothing more. It means that the read model of the application can be only be updated after the global state is updated on the event store. The read model can be any kind of database/storage. Also, we can have multiple read models* for different needs. For example, a read model can be a simple collection on Cosmos DB, a SQL Server database, or a sophisticated graph database. Please note that services can maintain and change their local state independently. We will further discuss how services can change the global state in the next post.
Step 7 – Event Grid dispatches events: The idea is to forward all events to Event Grid so, in the future, a new service can be added to listen to events of interest. Event Grid can help promote reactive programming across the cloud, helping in the transmission of events or other kinds of messages to any service. This is very important because Event Grid is an integration point that can allow many systems to connect and exchange messages. This Event Grid can start workflows by calling webhooks, by adding messages to queues and some other ways. Please note that Event Grid is designed to help "real-time" use cases, where the reaction should occur immediately after the message arrives. Event Grid can only route events on the present. New services won't be able to read past events unless they consume Cosmos DB data. It`s not currently possible to schedule messages to the future, although you can do this easily on Azure Service Bus.
Step 8 – Secondary Event Handlers: A secondary event handler in this architecture is an event handler that is called by Event Grid to process an event. It can be another system with an admin GUI, can be a microservice, can be Power BI, can be anything. In our example, we created a web application that manually validates if a given transfer operation is acceptable or not. It`s worth noting that those event handlers can produce more commands and events during a reaction to an event. This is very interesting and we discuss this in detail in the future parts of this series.
Step 9 – Other Reactions: Sometimes the services called during an event reaction will publish new events that potentially will change the global state. This is naturally acceptable. We can still react to these events as part of a more complex choreography of events.
To make things easier, in this example we share a library with all the functions and services. Our domain logic is centralized in a single library, as well as all commands and events of interest. This is a design decision that has pros and cons. Centralization of business rules, as we will see in the next part of this series, can dramatically help to assure we maintain the domain in a consistent state and we don't brake domain invariants. On the other hand, it brings coupling to a programming language and technology. For example, we indirectly force all the functions to be written in .NET Core 3.
No. Microservices can be hosted, for example, in a Kubernetes cluster, reacting normally to events and publishing more events. This kind of integration is planned to be in a future post.
As we saw above, we use Cosmos DB as our persistent event storage. This is simple to understand and its reactive capabilities make things simple to process events further. Having a queryable event store is an important advantage and can help in maintaining the overall simplicity of the solution. There are pros and cons to using Cosmos DB. To achieve higher throughputs it's often very expensive and this can be a blocker in many projects. On the other hand, using a managed service like this allow for a planetary-scale event store, with high availability and automatic management of data replication. And usually, it's very expensive to do this kind of architecture in other technologic stacks and clouds. For common business apps and use cases, Cosmos DB can help with reasonable costs. The growth in scale can be controlled, as well as the cost. We can do event streaming even for business applications, which is great.
There are use cases that are far more complex and depend on very fast data ingestion and processing. Azure offers many technologies to help in these scenarios. Notable examples of these offers are Event Hubs, Kafka (AKS / HDInsight), Stream Analytics, Azure Databricks. Azure Functions and Cosmos DB can help in this architecture also. These components can work in combination to implement Lambda Architecture or Kappa Architecture.
As a final note, it`s always important to understand if we are streaming messages/logs/telemetry information to be processed in near real-time or we are streaming business commands and events through the system. The notion of value of each message is very important. For example, it's viable to lose a message informing the temperature of a sensor that is sent every 30 seconds. On the other hand, we would have problems if we lose a message containing business rich information like the submission of a form used to create a new account. Different use cases require different architectures and different investments in cost.
Kafka was not intended, originally, to store messages forever. Kafka, like Azure Event Hubs, works better for use cases that need to deal with high data ingestion throughput and distribution to multiple consumer groups that can consume these messages at their own pace. The received messages are intended to stay on the log for a configurable time. Handling millions of messages per second, these messages are usually forwarded to real-time analytics providers and usually, batching and storage adapters. Kafka and Event Hubs are not meant to be the final store for the data - they are a very fast and scalable service for data ingestion and normally, the first part of a data processing/analytics pipeline.
That said, data ingested should normally be processed and stored where it can be queried and used in applications. In our opinion, Kafka should not be used as an event store, neither Azure Event Hubs. Products like Cosmos DB, EventStore (https://eventstore.com/) among others are more convenient for this purpose. Cosmos DB is a very interesting option when we are using the Azure cloud because it's serverless and there is no need to worry about its internal management. As a side note, other relational databases can be used also as an event store and many people have done so (https://github.com/commanded/eventstore).
Cosmos DB can help in the scenario of doing CQRS and Event Sourcing because it can be both an ingestion engine and an event store database with great querying support, not to mention the ability to automatically invoke a function when new data arrives. In the next part of this series, we delve into details of the code can be structured to help succeed using Event Sourcing techniques.
In this first part of the series, we introduced how a team can structure the project to create an event-driven business application, accomodating the needs for command and query separation, and taking advantage of an event store for the application. In the next part of the series, we will go into greater detail, explaining the code necessary to handle commands. We will also talk about the aggregate design.