SeattleDataGuy

Posted on Jan 28, 2020

Why Should You Use Streaming Data Platforms Like Kafka?

#database #devops #codenewbie

Back in my day, databases and applications used to only sync late at night while everyone was asleep.

Now in the modern era, everyone expects their data the second it's updated (if not somehow magically before the data occurs).

Large corporations and Fortune 500 companies depend on this data to be able to predict consumer tastes or estimate where the forces of demand and supply are moving the market.

Meanwhile, the average person depends on data --- from placing calls to booking a flight. Having this data as up to date as possible can save or make companies hundreds of millions of dollars.

Due to their potential to speed up the decision-making process efficiently, streaming-data systems can help analysts, machine learning researchers, and data scientists develop metrics and models that feed off live data.

What Is Distributed Stream Processing?

Simply put, distributed stream processing is a type of data-processing engine specifically designed to work for infinite datasets.

Essentially, they're what they sound like: large streams of data that'll always contain more data as long as the internet exists.

Streaming allows unbounded data processing in real time continuously for longer periods, while it's up and running. The truth of the matter is these types of systems can and are hard to implement --- and even harder to maintain --- because it often takes more than just good software.

It also needs to be paired with well-designed networks that help make the system fault-tolerant while still processing data efficiently.

Because of this, there are lots of open-source as well as managed services that can help engineers abstract some of the benefits of streaming into higher-level modules.

Before we go into that, let's talk about a few reasons you might decide to use a streaming service.

Characteristics of Distributed Data Streaming

Distributed data streaming comes with some very important features that form the basis of its strengths and weaknesses. Some of these are as follows:

Fault tolerance

In the case of node or network failure, this engine can recover quickly and start processing from the point it stopped. For doing so, these types of frameworks typically use checkpoints on states of streaming from time to time (this can sometimes be configured).

Performance

When it comes to performance, distributed streaming includes latency and scalability. Ideally, this type of engine needs to have minimum latency for maximum throughput.

Guaranteed delivery

Most importantly, this system guarantees a particular amount of data processing for a given time. The options given are atleast-once, atmost-once, and exactly-once.

Three Distributed Data Streaming Systems

There are plenty of options when it comes to streaming systems. Personally, we prefer not following every fad. Also, we think it's important to at least understand where a lot of data streaming systems started.

So here are five options you can consider.

Apache Storm

https://storm.apache.org/

What is storm?

Storm is a popular distributed real-time computation system that works for big data with a simple-processing model to carry out powerful abstractions. This framework --- made an open-source project by Twitter --- has been touted as the real-time Hadoop.

It can be used to process new data or to update a database. The distribution function of Storm waits for invocation messages, which upon being received, are computed in a query to construct results.

What is unique about Storm?

This software was developed by Nathan Marz in 2011 to harness higher throughputs while working on multiple nodes in a fraction of seconds.

The Storm software comes with the latency of just a few milliseconds on micro-batch processing, which obviously makes it a reliable data processor. Reliability is a factor that helps Storm stand out as a real-time computation data-processing system.

Apache Storm is based on the phenomenon of "'fail fast, auto restart" which allows it to restart the process without disturbing the entire operation in case a node fails. The approach makes it fault-tolerant.

Besides the standard configuration of Storm makes it fit instantly for production. This technology is user-friendly and robust which has made it popular among small and medium enterprises along with big-sized organizations.

Flink

https://flink.apache.org/

What is Flink?

Apache Flink is another popular open-source distributed data streaming engine that performs stateful computations over bounded and unbounded data streams. This framework is written in Scala and Java and is ideal for complex data-stream computations.

With continuous stream processing, Flink processes data in the form or in keyed or nonkeyed Windows.

What is unique about Flink?

This system is easy to install and can start working with just one command on the command-line interface.

Flink is most popular in the machine learning and data analytics fields, where it's paired with Gelly to create data-flow programming models. Flink supports timestamping, which makes it convenient to rollback or replay a job.

It uses save points to help in system operations in order to ensure correct results are provided across failures if a node crashes. This framework processes both real-time and stream data, so its ideal for both record and data batches.

Flink is also considered a great alternative to MapReduce --- as it's designed to run stateful streaming for any scale. This framework is independent of Hadoop, but it can be integrated with Hadoop to store, write, or process data.

Kafka

https://dzone.com/articles/what-is-kafka

What is Kafka?

The Apache Kafka framework is a distributed publish-subscribe messaging system which receives data streams from disparate source systems.

This software is written in Java and Scala. It's used for real-time streams of big data that can be used to do real-time analysis. This system isn't only scalable, fast, and durable but also fault-tolerant.

Owing to its higher reliability and throughput, Kafka is widely used for tracking service calls and IoT sensors data.

A brief history of Kafka

So who uses Kafka? Well, it originated with LinkedIn to provide a mechanism to load parallel data in Hadoop systems. Later, in 2011, it became an open-source project under Apache, and now LinkedIn is using it to track operational metrics and activity data. Twitter also uses it --- paired with Storm --- to build a stream-processing infrastructure.

What makes Kafka stand out as a software?

Kafka is our personal favorite distributed data streaming system because of its operational simplicity. Also for Amazon, a managed-service version of Kafka makes it much easier to implement in your AWS stack.

Newer versions of Kafka not only offer disaster recovery to improve application handling for a client but also reduce the reliance on Java in order to work on data-streaming analytics. Overall, it feels like the easiest service to manage, personally.

But Is Streaming Worth It?

Streaming data tools can provide a lot of benefits depending on the use case. They can help provide the ability to mange and process data live.

This can lead to better notifications and decision making.

In addition, the ability to stream and analyze data can allow machine-learning models the ability to provide much better outputs.

Although often these systems are much more difficult to implement compared to daily batch jobs, there are many cases in which the ROI is worth it.

We hope this helped prime you for the different options you have for streaming tools.

Good luck in your development.

If you want to read more:
Airbnb's Airflow Vs Spotify's Luigi

Automating File Loading Into SQL Server With Python And SQL

5 Skills Every Software Engineer Needs Based Off Of A Job Description

The Top 10 Big Data Courses, Hadoop, Kafka And Spark

Data Science Use Cases That Are Improving the Finance Industry

Data Science Consulting: How To Get Clients

Top comments (3)

Joe Zack • Jan 30 '20

I'm always happy to see more posts on streaming. Real time centric apps are such a great user experience, and apps like Uber are spoiling my users into expecting that kind of experience from me. I hope one day that a streaming app will be the new HelloWorld/Twitter/ToDo list.

Víctor Gil • Feb 15 '20

I could not agree more! There is a need for more examples of real-time applications because designing and coding them requires a paradigm shift which is never a straight-forward process.
BTW, I have written one.

Helen Anderson • Jan 29 '20

Nice article, thanks for writing it up.

It's easy to get confused with so many tools claiming to do very similar jobs :O