This post was originally posted on my personal blog.
Apache Kafka is a distributed data streaming platform made for publishing, subscribing to, storing, and processing streams of events or records in real time. It is designed to take in data from multiple sources, store that data in a reliable way, and allow that data to be consumed from multiple systems. It is also designed to handle trillions of events per day. It was originally developed at LinkedIn and is now an open source Apache project.
Apache Kafka is an alternative to traditional message queue systems, such as ActiveMQ or RabbitMQ. A message queue is a form of asynchronous service-to-service communication. It allows a service to send messages to a queue, where another service can then read those messages. Services that write to a queue are typically called producers. Services that subscribe and read from a queue are called consumers.
This communication is called asynchronous because once a service sends the message, it can continue doing other work instead of waiting for a response from another service. In a nutshell, message queues allow a number of systems to pull a message, or a batch of messages, from the end of the queue. Typically, after a message has been read, it is removed from the queue.
The most general implementation of a message queue is a task list, where a consumer reads a task off the queue and processes it. Multiple consumers can be added to add concurrency and improve task processing speed, but it does not allow for multiple actions to happen based on the same message. This can be generalized as a list of commands where each command is only processed by one consumer.
To improve upon this, the publish/subscribe model was born (a.k.a. pub/sub). In the pub/sub model, multiple consumers can subscribe to the same queue and each consumer can read the same message independently.
For example, imagine a queue that provides the latest stock price of a given stock. There could be many systems that would be interested in consuming the latest stock price. Those systems can subscribe to the queue and each system will read the latest stock price, even if another independent system has already read it.
This can be generalized as a list of events where each consumer can process every event.
Apache Kafka, as stated above, is an alternative messaging system that encompasses the concepts of message queues, pub/sub, and even databases. A producer can publish a record to a topic, rather than a queue. Consumers can then subscribe to and read messages from that topic. Unlike most message queues, messages from a topic are not deleted once they are consumed; rather, Kafka persists them to disk. This allows you to replay messages and allows a multitude of consumers to process differing logic for each record, or like the example above, each event.
There are many benefits provided by Apache Kafka that most message queue systems were not built to provide.
Reliability : Kafka is distributed, partitioned, replicated and fault tolerant. We'll explore what this means later on.
Scalability : Kafka scales easily to multiple nodes and allows for zero-downtime deployments and upgrades.
Durability : Kafka's distributed commit log allows for messages to be persisted on disk.
Performance : Kafka's high-throughput for publishing and subscribing allows for highly performant distributed systems.
As described above, Kafka provides a unique range of benefits over traditional message queues or pub/sub systems. Let's dig deeper into the internals of Kafka and how it works.
The architecture of Kafka is organized into a few key components. As a distributed system, Kafka runs as a cluster. Each instance of Kafka within a cluster is called a broker. All records within Kafka are stored in topics. Topics are split into partitions of data; more on that later. Lastly, producers write to topics and consumers read from topics.
At the heart of Apache Kafka lies a distributed, immutable commit log, which is quite similar to the git log we all know and love. Each record published to a topic is committed to the end of a log and assigned a unique, sequential log-entry number. This is also often called a "write-ahead log". Essentially, we get an ordered list of events that tell us two things: what happened and when it happened. In distributed systems, for many reasons, this is typically the heart of the problem.
As a side effect of Kafka topics being based around a commit log, we get durability. Data is persisted to disk and is available for consumers to read as many times as they would like to. If desired, Kafka can then be used as a source of truth, much like a database.
For example, imagine a
userstopic. Each time a new user registers within an application, an event is sent to Kafka. From here, one service can then read from the
userstopic and persist it in a database. Another service might read the
userstopic and send a welcome email. This allows us to decouple services from one another and often helps implement microservices and event-driven architectures.
As described above, Kafka stores data within topics. Topics are then split into partitions. A partition is an ordered, immutable log of records that is continually appended to. Each record in a partition is assigned a sequential id number, called the offset, that uniquely identifies the record within the partition. A topic is made up of one or more partitions.
Splitting topics into multiple partitions provides multiple benefits:
Logs can scale larger than the size of one server; each partition must fit within the size of one server but a topic with multiple partitions can spread across many servers
Consumption of topics can be parallelized by having a consumer for each partition of a topic, which we will explain later on
A Kafka cluster persists all published records using a configurable retention period. This is true for records that have and have not been consumed. Kafka's performance is not affected with respect to the size of the data on the disk; so storing data for a long time is not a problem. The retention period can be set based on a length of time or the size of the topic.
For example, if the retention policy is set to five days, then a record can be consumed for up to five days since being published. After those five days have passed, Kafka will discard the record to free up disk space.
Kafka can also persist data indefinitely based on the key of a message. This is very similar to a database table, where the latest record for each key is stored. This is called log compaction, and leads to what is called a
compacted topic. Messages with an outdated record will eventually be garbage collected and removed from the topic.
Each broker holds a set of partitions where each partition is either a leader or a replica for a given topic. All writes to and reads from a topic happen through the leader. The leader coordinates updates to replicas when new records are appended to a topic. If a leader fails, a replica takes over as a new leader. Additionally, a replica is said to be in-sync if all data has been replicated from the leader. By default, only in-sync replicas can become a leader if the leader fails. Out-of-sync replicas can be a sign of broker failure or problems within Kafka.
By having multiple replicas of a topic, we help ensure data is not lost if a broker fails. For a cluster with
n brokers and topics with a replication factor of
n, Kafka will tolerate up to
n-1 server failures before data loss occurs.
For example, let's say you have a cluster with
3brokers. Imagine a
userstopic with a replication factor of
3. If one broker is lost,
2in-sync replicas and no data loss occurs. Even further, if another broker is lost,
1replica and there is still no data loss. Impressive!
Load of the cluster is managed by distributing the number of partition leaders across multiple brokers within the cluster. This allows Kafka to handle high amounts of reads and writes without putting all the strain on one broker – unless you only have
Producers publish to topics of their choosing. Producers are responsible for assigning a partition to the record within the topic it's producing to. This can be done in a round-robin fashion to balance it or according to a semantic partition function (such as based on a key within the record).
For example, the default partition strategy for the Java clients use a hash of the record's key to choose the partition. This preserves message order for messages with the same key. If the record's key is
null, then the Java client will partition the data randomly. This can be useful for easily partitioning high-volume data where order does not matter.
Consumers in Kafka are organized into consumer groups. A consumer group is a set of consumer instances that consume data from partitions in a topic.
Consumers read from a single partition at a time, which allows us to scale the number of consumers to the number of partitions to increase the consumption throughput. Each consumer within a consumer group for a topic reads from a unique partition. The group as a whole then consumes all messages from the entire topic.
For example, imagine a topic with
6partitions. If you have
6consumers in a consumer group, each consumer will read from
1partition. If you have
12, six of the consumers will be idle while the other six consume from
1partition. If you have
3consumers, each consumer will read from
2partitions. If you had
1consumer, it would read from all of the partitions.
Each consumer group reads from a topic independent of any other consumer group. This allows for many systems (each having their own consumer group) to read every message in the topic, unlike consuming messages from a traditional message queue.
It's important to note that ordering within a topic is only guaranteed for each partition. Thus, if you care about the order of records, it's important to partition based on something that preserves ordering (such as a primary key) or to only use one partition.
This post was only a simple introduction to the key concepts of Kafka. We'll dig deeper into the internals of Kafka, the guarantees it makes, real-world use cases, and in-depth tutorials on how to use Kafka in further posts.
Overall, Kafka is quickly becoming the backbone of many organization's data pipelines. It allows for massive throughput of messages while maintaining stability. It enables decoupling of producers and consumers for a flexible and adaptive architecture. Lastly, it provides reliability, consistency, and durability guarantees that many traditional message queue systems do not. I hope you enjoyed learning about how Kafka can be a useful tool when building large-scale data platforms!