As mentioned in the previous article, Apache Kafka is a messaging system, specifically a publish-subscribe messaging system where there are publishers of messages and subscribers of messages.
Publishers creates the data and sends them to a specific location. They are called Producers.
On the other end, subscribers can listen in and retrieve the messages. These are called the Consumers.
At the middle lies the Topics which is basically a grouping of messages. This is where the producers send their data. Topics have specific name which can defined during creation or modified on-demand.
The main rule here is that the producers must know what is the topic name that they want to send their message to and they must also be authorized to do so. The same goes with subscribers - they must also be authorized to retrieved the message from the topic.
Kafka keeps and maintains all their topics in the broker. This is an executable or daemon service running on a machine, which can either be physical or virtual.
The broker has access to the file system of the machine. It uses the file system to store the messages and categorize as topics. Brokers receive the messages, assigns offsets, and commits them to storage on disks.
As with any executable, machines can have more than one broker but each must be unique so they won't conflict with one another. To do this, each broker will have its own broker id.
Brokers retain all published messages regardless of whether it is consumed or not. Because of this, Kafka's ability to handle the topics and messages gives it the main edge when it comes to achieving high-throughput.
To achieve high throughput, a system must be able to distribute its load and efficiently process it in multiple nodes and in parallel. Kafka achieves this by scaling out its brokers to accomodate the load, all this being done without affecting existing producers and consumers.
Recall that a machine can have one or more Kafka brokers running on it. A Kafka Cluster is a grouping of multiple Kafka brokers on a single machine or brokers on different machines.
Additionally, a cluster's access is through any of the broker within the cluster. This means when you connect to one broker, you also get access to all other brokers in that cluster.
The cluster size is the number of brokers that are within the cluster, regardless if these brokers are running on the same machine or separate machines.
Knowing about the cluster size is important since this is the mechanism which allows Kafka to scale to thousands of brokers. Now to enable the scaling out activity, the clusters will need metadata to operate at scale and reliably. This is where the Zookeeper comes in.
The succeeding notes will dive in deeper into Kafka as a distributed system and how Apache Zookeeper fits in the equation. If you'd like to know more, please proceed to the next note in the series.
Similarly, you can check out the following resources:
Getting Started with Apache Kafka by Ryan Plant
Apache Kafka Series - Learn Apache Kafka for Beginners v2 by Stephane Maarek
Apache Kafka A-Z with Hands on Learning by Learnkart Technology Private Limited
The Complete Apache Kafka Practical Guide by Bogdan Stashchuk
If you've enjoyed this short but concise article, I'll be glad to connect with you on Twitter!. You can also hit the Follow below to stay updated when there's new awesome contents! 😃