In the modern era, everyone expects their data the second it's updated (if not somehow magically before the data occurs).
Large corporations and Fortune 500 companies depend on this data to be able to predict consumer tastes or estimate where the forces of supply and demand are moving the market. Real-time data is all part of a modern data strategy.
In turn, many companies are working to modify their batch-style data pipelines into real-time data streams. Real-time data streams provide the ability for analysts, machine learning researchers, and data scientists to develop metrics and models that run as soon as new data is created.
This has become a useful solution for companies that manage manufacturing operations, movie streaming, and detect issues in system logging.
Real-time analytics are becoming more popular as well as more feasible for companies of all sizes, as the cloud provides various tools that can be quickly implemented.
We will be talking about a few of these companies later on, but let's start with two of the classics.
Kinesis is a managed streaming service on AWS. AWS Kinesis being managed provides several advantages compared to some of the other tools on this list. It allows your team to spend less time managing infrastructure components and services and instead focuses more on development. Kinesis allows you to ingest everything from videos, IoT telemetry data, application logs, and just about any other data format live. This means you can run various processes and machine learning models on the data live as it flows through your system instead of having to go to a traditional database first.
AWS Kinesis also has clear support from companies like Netflix. They use Kinesis to process multiple terabytes of log data every day. This is made easier by the fact that Kinesis is a managed service.
Photo from Apache Kafka.
The Apache Kafka framework is a distributed publish-subscribe messaging system that receives data streams from disparate source systems.
This software is written in Java and Scala. It's used for real-time streams of big data that can be used to do real-time analysis. This system isn't only scalable, fast, and durable but also fault-tolerant.
Owing to its higher reliability and throughput, Kafka is widely used for tracking service calls and IoT sensor data.
So who uses Kafka? Well, it originated with LinkedIn to provide a mechanism to load parallel data in Hadoop systems. Later, in 2011, it became an open source project under Apache. Now LinkedIn is using it to track operational metrics and activity data. Twitter also uses it --- paired with Storm --- to build a stream-processing infrastructure.
Kafka is our personal favorite distributed data streaming system because of its operational simplicity. Also, for Amazon, a managed-service version of Kafka makes it much easier to implement in your AWS stack.
Newer versions of Kafka not only offer disaster recovery to improve application handling for a client but also reduce the reliance on Java to work on data-streaming analytics. Overall, it feels like the easiest service to manage.
Going away from more of the classic real-time data solutions, we wanted to take a look at some of the newer startups that are trying to move into the streaming space. In particular, these real-time streaming solutions offer the ability to easily interact with the data in their streams using SQL.
Kafka and Kinesis also provide ways for you to interact with their data using forms of SQL. However, the tools below were developed to be SQL-compliant from the get-go.
Photo from Materialize.
Materialize is an SQL streaming database startup built on top of the open source Timely Dataflow project.
It allows users to ask questions of live streaming data, connecting directly to existing event streaming infrastructure (like Kafka) and client applications.
Engineers can interact with Materialize using a standard PostgreSQL interface, enabling plug-and-play integration of existing tooling.
When the SQL queries are run, they are recast as data flows. This can allow users to perform interactive data exploration and data warehouse-like analytics against live relational data, which is typically not possible.
Under the hood, Materialize uses Timely Dataflow (TDF) as the stream-processing engine. This allows Materialize to take advantage of the distributed data-parallel compute engine. The great thing about using TDF is that it has been in open source development since 2014 and has since been battle-tested in production at large Fortune 1000-scale companies.
"Our goal is really to help any business to understand streaming data and build intelligent applications without using or needing any specialized skills. Fundamentally what that means is that you're going to have to go to businesses using the technologies and tools that they understand, which is standard SQL." --- Arjun Narayan (via TechCrunch), co-founder and CEO of Materialize
Materialize also just got another round of funding, so they could be in for bigger and better things shortly.
Photo from Rockset.
What do all those fancy buzzwords mean?
Unlike some of the other real-time databases that are on this list, Rockset is a combination of a database plus a sort of SQL engine that allows you to query across multiple data sources in real-time. For example, you can sit Rockset on top of your DynamoDB, Kafka stream, and MongoDB databases and query/join across all of them.
According to Rockset's docs, it "automatically indexes your data --- structured, semi-structured, geo and time series data --- for real-time search and analytics at scale."
Also, Rockset provides a great UI for running your queries and several other features that are geared more towards developers.
Like Materialize, Rockset has also received a new round of funding and is currently hiring heavily. Those are all great signs in terms of progress.
Photo from Vectorized.
Vectorized is still on the newer side of streaming tools, as it just received $15.5 million in funding in January 2021.
The startup's entry into the crowded data management market is an open source stream-processing platform dubbed Redpanda. It aims to provide an alternative to the industry-standard Apache Kafka engine.
If you want to get a deeper explanation, you can hear from the founder of Vectorized, Alexander Gallego, as he discusses it in the Data Engineering Podcast.
In this podcast, he discusses how Redpanda was engineered as a drop-in replacement for Kafka. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces.
It's a great listen if you want to hear about the driving factors behind this technology.
There are a lot of options when it comes to picking the right real-time solution. Here are a few others that your team might be interested in. The tools below will require a more technical understanding.
Photo from Apache Storm.
Storm is a popular distributed real-time computation system that works for big data with a simple processing model to carry out powerful abstractions. This framework --- made an open source project by Twitter --- has been touted as the real-time Hadoop.
It can be used to process new data or update a database. The distribution function of Storm waits for invocation messages. Upon being received, they are computed in a query to construct results.
This software was developed by Nathan Marz in 2011 to harness higher throughputs while working on multiple nodes in a fraction of seconds.
The Storm software comes with a latency of just a few milliseconds on micro-batch processing, which makes it a reliable data processor. Reliability is a factor that helps Storm stand out as a real-time computation data-processing system.
Apache Storm is based on the phenomenon of "'fail fast, auto restart." This allows it to restart the process without disturbing the entire operation in case a node fails. The approach makes it fault-tolerant.
The standard configuration of Storm makes it a great fit for production. This technology is user-friendly and robust, which has made it popular among small and medium enterprises along with large-sized organizations.
Photo from Apache Flink.
Apache Flink is another popular open source distributed data streaming engine that performs stateful computations over bounded and unbounded data streams. This framework is written in Scala and Java and is ideal for complex data stream computations.
With continuous stream processing, Flink processes data in the form of keyed or non-keyed windows.
This system is easy to install and can start working with just one command on the command-line interface.
Flink is most popular in the machine learning and data analytics fields, where it's paired with Gelly to create data-flow programming models. Flink supports timestamping, which makes it convenient to roll back or replay a job.
It uses save points to help in system operations to ensure that correct results are provided across failures if a node crashes. This framework processes both real-time and streamed data, so it's ideal for both record and data batches.
Flink is also considered a great alternative to
MapReduce, as it's designed to run stateful streaming at any scale. This framework is independent of Hadoop, but it can be integrated with Hadoop to store, write, or process data.
Here is the hard part: Which real-time analytics tool should you pick? It's difficult to provide a concrete answer without knowing your team's needs and goals.
But I will provide some perspective.
If you're a small company, then you probably don't have the time or money to migrate your solution if one of these tools disappears for any reason. This is to say, if you were to pick a startup, you risk one of those solutions disappearing and then having to migrate to another tool.
This could be very costly.
So if you do decide to pick a startup, I would try to get a good deal on your initial rate --- just until there is either enough funding or another company buys them out.
Larger companies can more easily take advantage of some of these startups because if the startup disappears, then they can have a few engineers quickly fix the problem.
At the end of the day, I am sure a few of the startups will make it. But you want to make sure you are ready in case they disappear.
Streaming data tools can provide a lot of benefits depending on the use case. They can help provide the ability to manage and process live data.
This can lead to better notifications and decision-making.
Also, the ability to stream and analyze data can allow machine learning models to provide much better outputs.
Although these systems are often much more difficult to implement compared to daily batch jobs/ETLs, there are many cases in which the ROI is worth it.
We hope this helped prime you for the different options you have for streaming tools.
Good luck with your development.
If you liked this article, then check out these videos and articles!