DEV Community

Josh Fischer
Josh Fischer

Posted on • Edited on

Handling Failures in Streaming Jobs

The following article is an except from my forthcoming book Grokking Streaming Systems from Manning Publications. You can find it here.

The reason for streaming systems increasingly being used is the need for on demand data, and on demand data can be unpredictable. Components in a streaming system or the dependent external systems might not be able to handle the traffic and also, they might have their own issues occasionally. Let’s look at a few potential issues that could arise in the fraud detection system.
Streaming DAG

After all, failure handling is an important topic in all distributed systems and our fraud detection system is no
different. Things can go wrong and safety nets are important to prevent problems from arising.

New concept: backpressure

When the capacity utilization reaches 100%, things become more interesting. Let’s dive into it using the fraud detection job as an example.

backpressure

When the instance becomes busy and can’t catch up with the incoming traffic, its incoming queue is going to grow and run out of memory eventually. The issue will then prop- agate to other components and the whole system is going to stop working. Backpressure is the mechanism to protect the system from crashing.
Backpressure, by definition, is a pressure that is opposite to the data flowing direction: from downstream instances to upstream instances. It occurs when an instance can not process events at the speed of the incoming traffic, or in other words, when the capacity utilization reaches 100%. The goal of the backwards pressure is to slow down the incoming traffic when the traffic is more than the system can handle.

Measure capacity utilization

Backpressure should trigger when the capacity utilization reaches 100%, but capacity and capacity utilization are not that easy to measure or estimate. There are many factors that decide the limit of how many events an instance can handle such as the resource, the hardware, and the data. CPU and memory usage is useful but not that reliable to reflect capacity either. We need a better way, and luckily, there is one.
We have learned that a running streaming system is composed of processes and event queues connecting them. The event queues are responsible for transferring events between the instances, like the conveyor belts between workers in an assembly line. When the capacity utilization of an instance reaches 100%, the processing speed can’t catch up with the incoming traffic. As a result, the number of events in the incoming queue of the instance starts to accumulate. Therefore, the length of the incoming queue for an instance can be used to detect whether the instance has reached its capacity or not.
Normally the length of the queue count should go up and down within a relative stable range. If it keeps growing, it is very likely the instance has been too busy to handle the traffic.

backpressure-up-close

In the next few pages, we will discuss backpressure in more detail with our local streamwork engine first to get some basic ideas, and then we will move to more general distributed frameworks.
Note that backpressure is especially useful for the temporary issues, such as instances restarting, maintenance of the dependent systems, and sudden spikes of events from sources. The streaming system will handle them gracefully by temporarily slowing down and resuming afterwards without user intervention. Therefore, it is very important to understand what backpressure can and cannot do, so that when system issues happen, you have things under control.

Backpressure in the streamwork engine

Let’s start from our own streamwork engine first since it is more straightforward. As a local system, the streamwork engine doesn’t have complicated logic for backpressure. However, the information could be helpful for us to learn backpressure in real frameworks next.
In the streamwork engine, blocking queues(queues that can suspend the threads that try to append more events when the queue is full, or take elements when the queue is empty) are used to connect processes. The length of the queues are not unlimited. There is a maximum capacity for each queue and the capacity is the key for backpressure. When an instance can’t process events fast enough, the consuming rate of the queue in front of it would be lower than the insertion rate. The queue will start to grow and become “full” eventually. Afterwards, the insertion will be blocked until an event is consumed by the downstream instance. As the result, the insertion rate will be slowed down to the same as the event processing speed of the downstream instance.

backpressure-in-streamwork-engine

Backpressure in the streamwork engine: propagation

Slowing down the event dispatcher isn’t the end of the story. After the event dispatcher is slowed down, the same thing will happen to the queue between it and the upstream instances. When this queue is full, all the instances of the upstream component will be affected. In the diagram below, we need to zoom in a little more than normal in order to see the blocking queue in front of the event dispatcher that is shared by all the upstream instances.

backpressure-propagation

When there is a fan-in in front of this component, which means there are multiple direct upstream components for the downstream component, all these components will be affected because the events are blocked to the same blocking queue.

backpressure-fan-in

Our streaming job during backpressure

Let’s look at how the fraud detection job is affected by backpressure with our streamwork engine, when one score aggregator instance has trouble catching up with the incoming traffic.
At the beginning, only the score aggregator runs at a lower speed. Later, the upstream analyzers will be slowed down because of the backpressure. Eventually, it will bog down all your processing power, and you’ll be stuck with an under-performing job until the issue goes away.

streaming-job-during-backpressure

Backpressure in distributed systems

Overall, it is fairly straightforward in a local system to detect and handle backpressure with blocking queues. However, in distributed systems, things are more complicated. Let’s discuss them in two steps:

  • Detecting busy instances
  • Backpressure

Detecting busy instances

As the first step, it is important to detect busy instances so that the systems can react proactively. We have mentioned in chapter 2 that the event queue is a widely used data structure in streaming systems to connect the processes. Although normally unbounded queues are used, monitoring the size of the queues is a convenient way to tell if an instance can keep up with the incoming traffic or not. More specifically, there are at least two different units of length we can use to set the threshold:

  • The number of events in the queue
  • The memory size of the events in the queue

When the number of events or the memory size reaches the threshold, there is likely some issue with the connected instance. The engine declares a backpressure state.

detecting-backpressure

Backpressure state

After backpressure state is declared, similar to the streamwork engine, we would want to slow down the incoming events. However, this task could often be much more complicated in distributed systems than in local systems, because the instances could be running on different computers or even different locations. Therefore, streaming frameworks typically stop the incoming events instead of slowing them down to give the busy instance room to breathe temporarily by:

  • Stopping the instances of the upstream components or..
  • Stopping the instances of the sources Although much less popular, we would also like to cover another option later in this chapter: dropping events. This option may sound undesirable, but it could be useful when end to end latency is more critical and losing events is acceptable. Basically, between the two options, it is a trade-off between accuracy and latency. The two options are explained in the diagram below. We’ve added a source instance to help with explanations, and left out the details of some intermediate queues and event dispatchers for brevity.

slow-down-backpressure

Backpressure: stopping the sources

Performing a stop at the source component is probably the most straightforward way to relieve backpressure in distributed systems. It allows us to drain the incoming events to the slow instance as well as all other instances in a streaming job, which could be desirable when it is likely that there are multiple busy instances.

stopping-sources-backpressure

Backpressure: stopping the upstream components

Stopping the incoming event could also be implemented at the component level. This would be a more fine-grained way (to some level) than the previous implementation. The hope is that only specific components or instances are stopped instead of all of them and that the backpressure can be relieved before it is propagated widely. If the issue stays long enough, eventually the source component will still be stopped, Note that this option can be relatively more complicated to implement in distributed systems and has higher overhead.

stopping-upstream-components-backpressure

Relieving backpressure

After a job is in a backpressure state for a while and the busy instance has recovered (hopefully), the next important question is: what is the end of a backpressure state so that the traffic can be resumed?
The solution shouldn’t be a surprise as it is very similar to the detection step: monitoring the size of the queues. Opposite to the detection in which we check if the queue is “too full,” this time we check if the queue is “empty enough,” which means the number of events in it has decreased to be below a low threshold, which means it has enough room for new events now.
Note that relieving doesn’t mean the slow instance has recovered. Instead, it simply means that there is room in the queue for more events.

relieving-backpressure

One important fact to keep in mind is that backpressure is a passive mechanism designed for protecting the slow instance and the whole system from more serious problems (like crashing). It doesn’t really address any problem in the slow instance and make it run faster. As a result, backpressure could be triggered again if the slow instance still can’t catch up after the incoming events are resumed. We are going to take a closer look at the thresholds for detecting and relieving backpressure first and then discuss the problem afterwards.

You can read the rest of this material in the book Grokking Streaming Systems. You can find it here: https://www.manning.com/books/grokking-streaming-systems.

Top comments (1)

Collapse
 
joshfischer1108 profile image
Josh Fischer

Special "thank you" to Kit Menke who made a call out on an image with some annotated text that wasn't correct.

The original annotated text read:
"After too many events are accumulated in the queue, a backpressure event should happen to "push back" events from the upstream components."

With his feedback I've since changed it to:
"After too many events are accumulated in the queue, a backpressure event should happen to "slow down" events from the upstream components."

The change from "push back" to "slow down" was needed as "slow down" is more correct.