Kafka 101

#kafka #architecture #beginners #cloud

So what is Kafka?

It is a distributed event streaming platform

ok so what does it do :) ?

An event is a thing that happens. Any action happening in a computer or a service will result in a event, its usually getting stored in logs in a traditional system. Now this can be a file based log file or getting saved into databases.

But these are old school techniques and doesn't scale nicely, ever seen a peta byte scale database? so internet scale requires, new tech and Enter Kafka, a distributed event streaming platform, which is designed keeping all the scaling problems in mind!

To under stand more, lets take an example.

Example

lets imagine that you are taking an uber ride, what usually happens ?

1) You open the app and look for near by rides and this kicks off a series of internet calls. Between your app and the uber servers a lot of api calls were initiated, ex: the user app api, which will first log an event , stating that you have opened the app with its date and time, location etc. (yes! everything that you do in internet , every moment / event is tracked, welcome to internet ) and then your login which is another event..

2) Then you start seeing cars nearby , which involves maps api, a driver api, and an algorithm that calculates the distance between you and your driver(s) and then a lot of different sets of permutations combinations like which uber vehicle type you are requesting, is it peak pricing, are you looking for a shared pool, how many users are nearby etc. , each and every one of these actions will generate events that gets stored somewhere (hint : Kafka)

3) Then , you get matched to a driver and finally get a ride and driver will reach you in x mins, and once you get in your car, another api will determine the map traffic and your driver's speed for an arrival estimate again second by second events are getting generated.

4) finally your ride gets completed and then various api's like payment , review api's gets invoked and the ride comes to an end , by this time your ride alone would have created more than 100 events (barring the second by second map api calls)

but, why cant we use rdbms databases?

Now lets "assume" that in the united states 10000 people are booking a uber ride in any minute and each ride will create 100 events, and estimating every ride will take 10 mins.

So for the next ten minutes you will have approximately 10 million events generated and if we are thinking of storing it on a database (a rdbms) then good luck :) .

even though , theoretically this can still be stored in a database , the constant growth of these events will result in a place where you cant scale the rdbms anymore!

Currently , uber handles trillion such events .Uber processes trillions of messages and multiple petabytes of data per day. Imagine putting these on a database :D

So What's the solution

So how do we solve this? enter Kafka, with its topics and producers and consumers. and what are those?

1) a Kafka producer is something that shares the events, here the rider, driver etc.

2) the Kafka consumer is something that consumes these rider driver events and performs computations, like your cost of ride, your estimated time etc.!

check out this uber architecture and try to map the contents so far..

Uber has one of the largest deployments of Apache Kafka in the world.

Technical details

Topics

Events are organized and durably stored in topics.
Very simplified, a topic is similar to a folder in a filesystem, and the events are the files in that folder. An example topic name could be "payments". Topics in Kafka are always multi-producer and multi-consumer, it can have n number of producers writing and reading at same time..

Topics are partitioned, meaning a topic is spread over a number of "buckets" located on different Kafka brokers.

And then there are many API's

The Admin API to manage and inspect topics, brokers, and other Kafka objects.
The Producer API to publish (write) a stream of events to one or more Kafka topics.
The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.
The Kafka Streams API to implement stream processing applications and microservices. It provides higher-level functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event-time, and more. Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.
The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka. For example, a connector to a relational database like PostgreSQL might capture every change to a set of tables. However, in practice, you typically don't need to implement your own connectors because the Kafka community already provides hundreds of ready-to-use connectors.

More @ https://kafka.apache.org/intro