In this blog, I'm going to cover what are the problems of using data pipelines to establish a connection between servers across the network and how Kafka's messaging system resolves that problem.
In the real-time scenario, we have different system process services which will be communicating with each other. And data pipelines are the ones which are establishing connections between two servers or two systems.
Now let's take an example of an e-commerce website, it has multiple servers for the front-end like a member application server for hosting applications, it has chat servers for customers' chat service facilities, separate servers for payment, etc. Similarly, organizations can also have multiple servers at the back-end which will be receiving messages from different front-end servers based on the requirements. For example, they can have database servers that will be storing the records, security systems for user authentication and user authorization, Real-time monitoring servers, etc. So all these data pipelines become complex with an increased number of systems. And adding a new system or server requires more data pipelines, which will again make the data flow more complex. Now managing these data pipelines also become very difficult as each data pipelines have a different set of requirements. For example, data pipelines that handle transactions should be more fault-tolerant and robust. Because of this complexity, the "Messaging System" was originated. Messaging system reduces the complexity of data pipelines and makes the communication between systems simpler and manageable. Using the messaging system you can quickly establish remote connections and send your data across the network.
Let's see how Kafka resolves this problem, Kafka decouples the data pipeline and solves the complexity problem. The applications which are producing messages to Kafka are called "producers" and the applications which are consuming those messages from Kafka are called "consumers".
In the above diagram, Web-Client, Application-1, Application-2, etc are producing messages to Kafka. These are called producers. Then the Database servers, Security systems, Real-time monitoring, etc these are basically consuming the messages and these are called consumers. So the producers send the messages to Kafka and Kafka stores those messages and consumers who want those messages can subscribe and receive them.
In this workflow, multiple consumers can consume the same application servers by subscribing to it, as well as adding a new service or system and removing an existing system are very easy.
Apache Kafka is a distributed public-subscribe messaging system. Messaging traditionally has two models. That is, Queuing and Public-Subscribe. In a queue, a pool of consumers may read from a server, and each record only goes to one of them. Whereas Public-Subscribe, the record is broadcasted to all the consumers so that multiple consumers can get the records.
Kafka is faster, scalable, and fault-tolerant because the Kafka cluster is distributed and has multiple machines running in parallel. It was originally developed on LinkedIn and later on become a part of the Apache foundation.
1 . Topic
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber. This means that a topic can have zero, one, or many consumers that subscribe to the data written to it.
2 . Partitions
Kafka topics are divided into a number of partitions. And partitions allow you to parallelize a topic by splitting the data into a particular topic across multiple brokers. This means each partition can be placed on separate machines to allow various consumers to read from a topic parallelly.
For example, you have a sales topic with 3 partitions as partition-0, partition-1, and partition-2. So always 3 consumers can read data parallelly.
3 . Producers
Producers publish data on the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic.
4 . Consumers
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
5 . Brokers
Kafka cluster is a set of servers, each of which is called brokers.
6 . Zookeeper
Zookeeper is another Apache open-source project, it stores the metadata information related to the Kafka cluster like brokers information, topics details, etc. So zookeeper is basically the one who is managing the whole Kafka cluster.
These are the basic Kafka commands for producing and consuming messages.
Go to kafka home,
- Create new topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3--topic [scenario]
- List existing topics:
bin/kafka-topics.sh --zookeeper localhost:2181 --list
- Describe a topic:
bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic mytopic
- Delete a topic:
bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic mytopic
- Purge a topic:
bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic mytopic --delete-config retention.ms
- Consume messages with the console consumer:
bin/kafka-console-consumer.sh --new-consumer --bootstrap-server localhost:9092 --topic mytopic --from-beginning
- List the consumer groups known to Kafka:
bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list
bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --list
- View the details of a consumer group:
bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe --group <group name>
In this article, I've covered the problems of using data pipelines and how Kafka's messaging system resolves that issues. Also, I've explained some important commands and terminologies of Kafka.
Hope this is helpful.