DEV Community

Marcos Maia
Marcos Maia

Posted on • Originally published at Medium

The basics of producing data to Kafka explained using a conversation…

This is a repost from a friend's work and has been published with his authorization. Link to the original article and his twitter account are available at the end of this post.

Learning a new technology or programming language can sometimes be overwhelming especially when you are new to software development. Looking back at the days I clearly remember reading the book Design Patterns: Elements of Reusable Object-Oriented Software by the Gang of Four as part of my software development study. Although the book translated was in Dutch (my mother language) most of the patterns didn't stick in my head at all.

After some time I discovered the book 'Head First Design Patterns'. I saw sold! Once I started to read the book I couldn't stop. The combination of simplicity, funny drawings and the right to the point explanation made the patterns stick in my brain and helped me to pass my exam! Inspired by 'fireside chats' sections in the book you will learn the basics of producing data to Apache Kafka in my blog post today! Happy Reading.

In today's Fireside Chat:

Bill 🤓 = Enthusiastic fresh software developer who developed some applications and has just heard about Kafka.

Jay 😎 = Senior software developer, he knows his way around in Apache Kafka and runs applications leveraging Kafka in productions for many years now.

🤓 : Hi I heard my fellow developers talking about Apache Kafka recently and I would like to learn more about it so I might can use Kafka in one of my next projects. They all say Kafka is cool can you give me a crash course?

😎 : Sure man Apache Kafka is great you are going to love it! But remember don't use a (new) technology because it's cool it should fit the use case.

🤓 : OK so Kafka is just another database like MySQL or PostgreSQL right?

😎 : Not like a 'traditional database' Kafka is a distributed append-only log. 

🤓 : A distributed what?

😎 : No worries we will get there. To make it more clear let me explain a use- case we are working on. In this project we have to deal with a continuous stream of price updates on stocks traded at stock exchanges around the world. We receive between 2.000 up to 15.000 price updates on stocks.

🤓 : per day?

😎 : no, per second! So we don't want to write that amount of data to a traditional database. Since we also have other requirements in the project to use this data in real time we decided to produce this data to a Kafka topic. 

Each individual price update on a stock is a record. Each record is a key/value pair to be sent to specified Kafka topic. 

Kafka is organized in topics you can create a topic for each type of record you want to store. Think of a topic as a 'feed of data'. To keep it simple in this example every record will contain the symbol of the stock and the price in US Dollar. Let me show you a picture:

Topic Basics

Each price update on a stock we receive will be produced to a topic with the name 'stocks' and will be appended to the log. In this example, we have an Apple Stock (APPL) with the price of $198.78 and later on a price update for the Netflix (NFLX) stock came in with a price of $ 369.21.

🤓 : But I can imagine you receive multiple price updates during the day for the same stocks so then you probably need to update something for that stock on the topic right?

😎 : No as I told you before Kafka is not a traditional database. You can't change history previously stored records are immutable so you can only append new 'records' to your topic! Remember Kafka is a distributed append only log. So when we receive new prices they will be appended like this:

Topic Basics 2

Note: In between the price updates for the Apple and Netflix stock we received updates on other stocks as well for simplicity we will focus on the Apple and Netflix stocks.

🤓 : OK got ya… But if you receive up to 15.000 price updates a second can you store all that data on a single machine?

😎 : Good question! Let me tell you a little bit more about the distributed nature of Kafka. The first thing you have to know is that we call a Kafka machine a Kafka broker. Multiple brokers work together in a Kafka cluster.

Kafka Brokers Cluster

The second important part to understand is Kafka breaks your topic log up into partitions. Records produced to the 'stocks' topic (in this example 3 partitions) will be divided over the partitions like this:

Topic Partitions routing

When you combine the two you understand Kafka will spread the partitions over multiple brokers:

Brokers and Partitions

🤓 : I noticed the records in the topic now ended up in different partitions is that what you want?

😎 : It depends whether or not you care about the order the records will be consumed later on. In this case we didn't provide a routing key for the records to they are spread in a round-robin fashion over the partitions.

In case you provide a routing key, Kafka makes sure the records with the same key will end up in the same partition.

So remember the order of your records in your topic is only guaranteed per partition!

Partitions and routing

What do you think happens when one of the Kafka brokers will go down? This doesn't have to be a crash of a broker it can also be because of maintenance.

🤓 : I’m still a bit confused why a topic is split into multiple partitions, why is that so important?

😎 : It’s important because replication in Kafka is implemented at the partition level. Let me explain this by taking a quick look at the picture of partitions and the brokers again. As you can see again 3 partitions are spread out over the 3 brokers in the cluster. It means the total ‘data set’ for that topic doesn’t live in one single broker.

Alt Text

What do you think happens when one of the Kafka brokers will go down? This doesn’t have to be a crash of a broker it can also be because of maintenance.

🤓 : It will probably mean we will lose one-third of the data!

Node down

😎 : You are a quick learner Bill! That’s correct we will lose data and that’s unacceptable. To avoid losing data we can configure a Kafka topic to make one or more copies of each partition. Kafka will make sure each replica (that’s how we call a copy) will end up on a different broker.

Node down replicas

So now we can survive in case one of our brokers is going down without losing data. Because all our partitions 0, 1 and 2 are available.

Node down replicas 1

🤓 : Thanks Jay I learned a lot today. Let’s have another chat soon!

😎 : You are welcome Bill. Next time we will chat about the basics of consuming data from a Kafka topic see you next time.

References

Discussion (0)