Apache Kafka is in its core a distributed, scalable and fault tolerant log system exposed as a Topic abstraction and implemented with a high-performance and language agnostic TCP protocol, which enables it to be used in different ways:
- Stream Processing System
- Message System
- Storage System
Apache Kafka documentation is really concise and to the point, I really recommend you to read it as your initial reference to working with Kafka.
Kafka has abstractions over its distributed logging, a Topic, also provides client APIs to interact with it. Let's know a bit more about how the topics work behind the scenes to provide high availability and scalability and also about the client APIs that enables us to produce, consume and manipulate the existing data, but first, we're going to take a look at a high-level overview of Kafka with its brokers and clients.
Kafka Brokers are the core where the topics are managed and they also provide some basic support to coordinate the Kafka Clients which will be explained in more details in another post.
Adding more servers to a Kafka Cluster is somewhat trivial all it's needed is to start up a new instance, assign it a unique broker id and once they're started, assign some of the partitions to the new broker. From there Kafka will take care to copy the data to the appropriate broker and integrate the broker to the cluster managing the existing partitions and replicas automatically.
Kafka provides a Topic abstraction over it's distributed log system which can be partitioned 1 to N times and its partitions also can replicate 1 to N times for high availability and durability.
A topic can have zero to many consumers that subscribe to receive the data contained in it and each topic can be partitioned multiple times, when a topic is partitioned the messages will be stored and routed to each partition based if you provide or not a key when sending a message to the topic.
Producers enable us to publish messages to topics. The producer is also responsible to route our message to the appropriate partition if we provide a key or does it in a Round robin fashion by default if we don't provide the key with the message.
The Java implementation of a Kafka Producer is Thread safe so it can be reused to send messages to different partitions and even topics and per some benchmarks reusing the topic is actually recommended and achieves better performance in many cases.
Messages are always appended to the log in the same order as they're sent by the producer and each message will automatically get an "offset" number. A consumer reads the messages in the same order as they're stored in the log.
When you use a Kafka Producer to send a message to a specific topic and don't provide a key, the producer will use round robin and distribute the messages homogeneously between all partition.
If you provide a key for each message the producer will hash the key to pick the partitions, the result hash value must be in the interval 0-NP(zero to number of partitions), which will give you ordering guarantee over the specified key as they will all be in the same ordered partition.
Consumer read messages from 1 to N topics and its partitions.
The consumer has a lot of control over how those messages will be processed and can parallelize and distribute the load based on its configurations, this can be done with a configuration of a consumer group if you give the same consumer group to multiple instances consuming from the same topic they will share the load, dividing the available partitions on that specific topic between the existent group members which will then work as a point to point mechanism where each instance will read from a specific partition.
On the other hand, if you give each client instance it's own group id every instance will process the messages from all existing partitions for that specific topic.
Multiple consumers can read the data from a specific topic and each consumer will control its own offset.
It's a library that abstracts a consumer/producer client where the data is read and published back to a Kafka topic. In this process, you can do data transformation, joins, and many other interesting operations. The most notable and interesting difference between Kafka streams and other streaming platforms like Apache Flink, Apache Spark and other streaming platforms is that Kafka streams being a library it runs directly, co-located with your application, in the same JVM, which gives you a lot of flexibility and control as a developer and enables some nice possibilities.
There's a very nice article comparing Kafka Streams and Flink published at the Confluent Blog if you want to know more details about the differences I recommend you to read it.
The Kafka streams API is divided into two, a high-level Streams DSL which provides you stateless and stateful operations over the stream created from a topic and is, in most cases enough, the Streams API also provides a low-level Processor APIs which gives you more control with the tradeoff of extra complexity. The recommendation is to use it only in very specific cases where you need more control over your streams and can't get it done with the Streams DSL.
We will see Kafka Streams in more detail in future posts.
Kafka connectors enable you to "synchronize" data between Kafka and external resources in a reliable and performant way enabling you to move large volumes of data between your systems while taking advantage of the resilience and scalability of Kafka and leveraging its automated control of offsets and common framework to integrate different data sources in your organization.
Having standard, automated and reliable integration is something very important for a big organization and empower teams to do real-time processing and reactions over changing data from many different systems.
We will see Kafka Connectors in more details in future posts.
We've just finished a very quick crash course on Kafka. I hope this article is useful to give you a very quick high-level overview of Kafka, understanding the basics of how it works and it's APIs.
For a more detailed explanation of some of the concepts covered here, I strongly suggest that you take a look at this article from my friend Tim: Head First Kafka where those concepts are covered in a very friendly way.