DEV Community

ChunTing Wu
ChunTing Wu

Posted on

Stream Processing Introduction

The purpose of this article is to introduce stream processing in an easy-to-understand way, so it does not dive too deep into the technical details, nor does it mention specific frameworks. However, a few popular frameworks will be used as examples to illustrate.

Before we start to introduce stream processing, let's talk about batch processing, i.e., the opposing concept of stream processing.

Batch processing

I believe most of you are familiar with batch processing, in which a large scale data processing is performed after a given period of data and the final result is produced. This is the actual operation of batch processing, and the expertise area of batch processing is fixed amount of data processing.

However, in an event-driven architecture, the data, i.e., the events, are endless, that is to say, it is difficult to define a fixed amount. Therefore, the compromise is to change the fixed amount to a fixed time, so that the amount of data can be expected after a given time interval, and batch processing can be applied.

In other words, batch processing can be summarized in several characteristics.

  1. Fixed. Whether it is fixed time or fixed volume, the key is to make the amount of data predictable.
  2. Discrete. The data is computed in a specific interval, and the intervals are not related to each other at all.

The most typical approach is to process the data at the moment per second/minute/hr/day.

Stream processing

However, there is a difference in the approach of stream processing.

First of all, there is no concept of fixation in stream processing. Imagine it as a river, where water keeps flowing down from upstream, and if you don't catch it where you are at, the water will flow away.

The same is happening with data. The data source keeps generating data into the stream, and the application has to retrieve what it needs in the stream and use it.

Unlike batch processing, which is fixed time or fixed volume, if there is no given time interval, then the amount of data can be considered basically infinite. On the other hand, since the data is not divided into several time intervals by batching, there is a context between the data.

We can also summarize several characteristics for stream processing, which are exactly the opposite of batch processing.

  1. Infinite. In the lack of a specified time interval, both the amount of data and the time can be considered infinite.
  2. Continuous. The data in the data stream are related to each other.

Nevertheless, applications do not work that way, or perhaps it should be said that the human brain cannot understand the concept of continuous infinity. When we develop an application, we always give the input, compute it, and produce the output.

Therefore, when we develop stream processing applications, we take some techniques to make the data stream fit our mind as much as possible, in other words, we slice the data stream. The method of slicing is called windows.

There are three common types of windows.

  1. Tumbling window
  2. Sliding window
  3. Session window

Tumbling window

The first method of data slicing is tumbling window also known as micro-batch.


https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#tumbling-windows

The continuous data is processed by dividing the data into intervals of a fixed time. This is the same concept as batch processing, but the only difference is the length of the time interval. In batch processing, the time span is larger than the tumbling window.

In this way, data streams can be processed in a batch manner. Most of the streaming frameworks support this technique, e.g. Apache Spark Streaming, Apache Flink, Apache Samza, etc.

Sliding window

The second approach is the sliding window.


https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#sliding-windows

Given a window size and a window slide to move the window, a sliding window will be created. As you can see from the diagram, it is possible to overlap data between windows, and therefore to discover the context of the data.

The tumbling window mentioned in the previous section is actually a special case of a sliding window. When the window size is equal to the window slide, the sliding window becomes a tumbling window.

This is a very common data slicing pattern used in streaming, and not every streaming framework supports it; Apache Spark has streaming capabilities, but it only supports micro-batches.

Session window

The third approach is the session window.


https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#session-windows

Unlike the previous two methods, the session window does not have a fixed window size; instead, it has to wait until the data stops for a period of time, which is called session gap.

After the data is stopped for a period of time, the received data segments are processed, so the real-time performance is slightly worse than the previous two methods. However, since the data is grouped in advance and accumulated for a period of time, it is easier to process.

In addition, the session window provides an additional information about the data density at this point by the amount of data collected.

Again, this is not supported by every streaming framework, and Apache Spark does not support it.

Summary

It sounds like stream processing is the same as traditional message queue processing, where a broker stores the message and assigns it to a handler responsible for execution.

Is the difference only that one is a single message and another is a data stream?

In fact, this is only one of the differences. Stream processing has three advantages over traditional message workers.

  1. Understand the context of the data. Although this depends on the window setting, due to the nature of data streams, they are able to express much more semantically than a single message.
  2. Stream processing has state. To enable faster processing of message streams, streaming frameworks provide internal state management and also state persistence.
  3. Simultaneous processing of concurrent data streams. In a message queue worker architecture, it would be a challenge to handle multiple messages simultaneously, but in a stream processing framework, it is possible. Furthermore, the framework provides more methods for handling multiple event streams, such as join operations.

Overall, stream processing can be regarded as an advanced version of the message queue worker architecture.

There are two core components in the entire stream processing. The first one is Kafka as a broker and the other one is the streaming framework.

Kafka is able to provide very good functionality, including fault tolerance, persistence and ensuring message order, while also providing very high throughput.

On the other hand, the streaming framework is to provide a higher level of abstraction based on Kafka. It can make it easier for developers to get started, including the various windowing methods mentioned in this article and concurrent event stream processing.

More importantly, the streaming framework provides an exactly-once guarantee. in the message queue worker architecture, often only at-least-once guarantee, but in the streaming framework can further support exactly-once.

Conclusion

Traditional data processing architectures are very complex and often require various technologies, for example, ETL technology is required to extract and transform data from various data stores into a unified data warehouse or data lake, and various batch processing methods such as Map-reduce must be used to transform data into a useful format during analysis.

In order to solve the real-time problem of batch processing, different message queue processing frameworks are integrated to aggregate the batch dataset with the real-time dataset under specific real-time requirements.

The entire data processing architecture involves a variety of storage, message queues, scheduling and processing technologies. It is difficult to become master of all of them, and new requirements have to be developed to continue stacking on top of this complex architecture, which also affects productivity.

Streaming architectures were created to address these complexities. It is possible to use only two core components, Kafka and the streaming framework, to accomplish various feature requirements that previously required various technology stacks.

Of course, no technical selection is perfect.

To me, the biggest problem with stream processing is that it's not easy, either to understand or to develop. As described earlier in this article, the concept of continuous infinity is not intuitive to humans. Even if it is possible to collapse the stack of techniques into a single solution, it requires developers to have more experience and overcome a learning curve. Moreover, when problems are encountered that cannot be solved by a framework, there is still a need to mix solutions.

Nevertheless, the streaming framework still offers a good entry point to meet most development needs with minimal overhead.

The next article will introduce a common streaming design pattern, which is enrichments. Let's call it a day.

Top comments (0)