Life Beyond Kafka with Apache Pulsar

#bigdata #streaming #apache

During all my years as a Solution Architect, I have built many streaming architectures such as real-time data ETL, reactive microservices, log collection or even AI-driven services using Kafka as a core part of their architecture. Kafka is a proven stream-processing platform used for many years at companies like LinkedIn, Microsoft, and Netflix. In many cases Kafka works very well, supports large amounts of data, and has a good community. Because of that, Kafka is used for many streaming scenarios.**

However, due to the design of Kafka, all of my projects using Kafka have been suffering similar problems:

High latency
Poor scalability
Difficulty supporting a global architecture
High OpEx

Latency and throughput

Latency, or the delay before a transfer of data begins, could be a nightmare for anyone working with data-intensive applications. As IoT-enabled applications such as autonomous vehicles and even industrial inspection become commonplace, the data generated from sensors will become too demanding for existing architectures. To maintain low latency while keeping up with the ever growing throughput requirements becomes a big challengeT As a result, data takes longer to move from devices to data centers, causing the user experience to degrade exponentially.

Apache Pulsar shows notable improvements in both latency and throughput in comparison to Kafka. Pulsar is approximately 2.5 times faster and 40% less latency than Kafka (*). Those differences are huge, and in critical systems they can mean success or failure.

There are many techniques that Pulsar use to improve performance. The most important technique is used to handle tailing reads. In a scenario where readers are only interested in the most recent data, the readers are served from an in-memory cache in the serving layer (the Pulsar brokers), and only catch-up readers end up having to be served from the storage layer (Apache BookKeeper). This approach is key to improving the latency and throughput compared to systems such as Kafka.

If you are more interested in the matter, Chris Bartholomew wrote recently a very good article benchmarking latency that compares Apache Pulsar and Kafka.

Scalability issues

Imagine you have thousands or millions of devices sending data to your data lake. This data must be managed with speed, security and reliability. In addition, for legal reasons you must partition data by country, device and city. These requirements seem reasonable and in 2019, stream-processing platforms must be able to deal with them. But how well do they? Kafka is not known to work well when there are thousands of topics and partitions even if the data is not massive. You can see how complicated it can be to try to solve performance challenges in these scenarios.

Fortunately, Pulsar is designed to serve over 1M topics in a cluster. The key to scaling the number of topics is how data is stored. In Kafka, data for a topic is stored in dedicated files and directories, but as a result Kafka has trouble scaling because I/O will be scattered across the disk as these files are flushed from the page cache to disk periodically. In contrast, Pulsar stores data in bookies (BookKeeper servers), where messages from different topics are aggregated, sorted, and stored in large files and then indexes. With these, Pulsar is able to scale to millions of topics.

Global architectures

Another common error in many projects I have participated in is the limited scope of their initial design. When you begin to design the architecture, you are often focused on the ROI for the first year and on local impact. However, when future expansion to new countries becomes mandatory, you are often forced to expand that same infrastructure to new regions without a global architecture design.

Kafka brokers are designed to work together in a network in a single region or even availability zone. So, there is no easy way to work with a multi-datacenter architecture. In contrast, geo-replication is an out-of-the-box feature in Pulsar. Global clusters can be configured at the namespace level to replicate data among any number of clusters. Additionally, Pulsar’s multi-tenancy feature makes it possible to stand up one cluster for an enterprise while still providing isolation of data storage.

OpEx

Working in Agile projects, it is desirable to begin with fewer features and incrementally add new ones so that the project is not overwhelmed by so many services that must be coded, tested and maintained. In infrastructure there is a similar scenario. First, we have a small Kafka cluster that is enough for our current volume of data. In the following months, more and more customers arrive and the cluster can manage them by adding new partitions. However, there will be a point in time that a new server must be added to the cluster, and then not only do I have to mess with the configuration but I also have to re-balance the current topics. These are some examples of how the operational expenditure exponentially increases with a Kafka-based architecture.

Happily for us, Pulsar’s layered architecture and stateless brokers help make zero downtime in these cases possible. When a new broker is added to the cluster, it is immediately available for writes and reads and does not spend any time re-balancing data across the cluster. From the perspective of data storage (bookies), when a new bookie is added to the cluster, re-balancing of data based on the replication configuration will take place behind the scenes, without any impact on the cluster. Finally, Pulsar can be easily deployed in Kubernetes clusters, either in managed clusters on Google Kubernetes Engine or Amazon Web Services or in custom clusters. Easy to install and easy to maintain, as delivered with Pulsar, are exactly what we are looking for.

Final thoughts

Apache Pulsar is a powerful stream-processing platform that has been able to learn from the weaknesses of previous systems. Its layered architecture is complemented by a number of great out-of-the-box features including geo-Replication, multi-tenancy, zero rebalancing downtime, unified queuing and streaming, TLS-based authentication/authorization, proxy and durability. Compared to other platforms, Pulsar can give you the ultimate tools to deliver successful projects.

Ready to Pulsar!

(*) Benchmark performed by OpenMessaging Benchmark, a Linux Foundation project.