During my time as a software engineer, I've dealt with different kinds of event-driven systems. Some were using events in the UI, others were using events in the backend. In both classes of systems, I've managed to shoot myself in the foot. Let's look at what happened and how it could have been avoided.
An event bus is a nice way of decoupling communication between components, regardless of whether this is happening in the backend or in the UI. This makes it really easy and compelling to use the event bus as a way of telling other components what to do - i.e. using the events as commands.
This is one of those things that may seem like a good idea at the time but may later come back and haunt you. There are multiple reasons for this:
- Your commands typically cannot return any results to the caller. You'd have to use exceptions and return events to indicate the outcome of the command.
- It leads to logic that is difficult to understand and debug. Depending on how the event bus is implemented, it may not be obvious to see where the event was sent and where it ended up being handled.
- If the event bus itself is changed in the future, e.g. from being a synchronous bus to an asynchronous bus where each event handler runs in its own thread, you may run into some strange side effects.
When building event-driven applications, think of an event as something that has already happened (and name it accordingly). If the application is command-driven as well (which can be really useful in certain use cases), it should use a separate, dedicated infrastructure for routing and handling the commands.
In the previous section, we concluded that an event is something that has happened. Event consumers (or handlers, listeners, observers or whatever you want to call them) should typically not be allowed to change this.
With a simple event bus implementation that just loops through all the consumers and calls them synchronously, you may end up doing exactly that. If one of the consumers ends up throwing an exception, you may end up rolling back the entire transaction and also stopping the remaining consumers from even getting the event.
In some cases where you have a small number of consumers and you are in control of all of them, this may be the desired behaviour and so there is not anything inherently bad about this design. However, if the events are intended to be used by consumers outside your direct control (by other developers for instance), you may want to think twice about your design. If you want the event consumers to be able to affect the producer or other consumers in some way, this should happen explicitly through an API designed for that purpose and not by mistake because of incorrect use of an event bus.
There is also another issue that you can run into by not isolating your event consumers and producers properly. If you are using some kind of security context that is bound to the current thread, the simple event bus implementation mentioned above would invoke each consumer using the producer's security context. This can lead to unintended security breaches, e.g. if the consumers have been registered by different users.
In the previous section, we concluded that event producers and consumers should be isolated from each other. However, this may lead to a situation where some consumers succeed in processing an event and others fail.
Depending on the type of event this may or may not be a big deal. If we are talking about events that are short lived, such as a mouse move event or a GPS-position update of a portable device, we can probably live with a missed event or two as new events would render the old events obsolete anyway.
However, if we are talking about one-off events, you may end up with inconsistent data in your database. For example, in an order processing system, you may have received, processed and shipped an order but failed to generate an invoice because the event consumer that was supposed to handle that failed.
Addressing this problem is not really trivial and the solution would vary from system to system, but in principle it consists of three steps:
- Realise that your system is vulnerable to this issue. If the system works most of the time, you may not even notice this until it is too late.
- Make sure you can easily detect when an event has been missed, either manually or automatically.
- Make sure you can either replay the missed event or retrigger the needed actions in some way, either manually or automatically. If you are not familiar with the concept of event sourcing, I would recommend you to have a look at it.
In the previous section, we concluded that one of the ways of dealing with failed event consumers is to replay the event at a later time until all consumers have succeeded. But what would then happen with all the consumers that did succeed on the first try? We typically don't want them to process the same event more than once.
One way of doing this is to somehow keep track of which consumers have successfully handled an event and which ones have not. However, this can become complicated really fast.
A better approach is to make sure all your event consumers are idempotent by design.
In software, an idempotent operation is an operation that can be performed multiple times without changing the state of the system.
An idempotent event consumer would be a consumer that only changes the state of the system the first time an event is received. If the same event is received more than once, the system state remains unchanged. If all your consumers are implemented like this, you can replay your events as many times as needed without side effects. It also makes your system more resilient if you are receiving your events from a remote message queue that cannot guarantee that all events will be delivered exactly once 100% of the time.
On paper, your event-driven design may look nice, clean and simple. Your initial testing on your own workstation may confirm that perception. Then you release your software to production, and things start to go wrong.
It turns out some event consumers are a lot slower than others, and some event producers are producing more events than the consumers can keep up with. Also some event consumers are in turn producing events on their own, further increasing the load on the event bus.
Your first attempt at fixing this is to introduce thread pools for the event consumers, but it only helps for a little while. Soon, the thread pools' queues are full and they start to reject jobs (which means you are essentially throwing away events without handling them).
Event-driven systems are dynamic by their nature. However, it is very easy to only look at the static aspects when you design a software system because they are easier to grasp and reason about. Especially in multi-user systems and systems that respond to events coming from outside the system, it is important to think about event dynamics from the start:
- How long will it take for a particular event consumer to process a single event?
- Will an event consumer emit new events? What will that lead to?
- What is the estimated event throughput during normal load and peaks?
- What is the highest event throughput your system should be able to handle and what should happen if the system can't keep up?
This is a back pressure problem that you will run into in any system where you have a producer that is producing data faster than a consumer can process it. There is no one-stop solution to it and discussing it in more detail is way of out scope of this post. However, here are a few things to get you started:
- If you can use a ready-made, battle tested product for solving the problem, use it.
- If events arrive in bursts, you can store them in a queue. Please note that this won't work if there is never time to empty the queue between bursts.
- If the events can be aggregated, you can store them in a queue and then have the consumer process them in groups rather than one by one.
- If you are in control of emitting the events in the first place, you may want to do that in sequence instead of in parallel.
- If your system and environment allows it, you could automatically scale out the number of event consumers as the load increases and then scale back in when it decreases (this typically requires a message broker that is able to dynamically route messages to different queues).
- If you can control the event producer in some way (either by asking it to slow down or stop until further notice), do it. And if you are the one implementing the event producer, consider adding this feature.