Clusterception Part 1: Introduction

#apachekafka #kubernetes #azure #strimzi

This post is part of a series on running Kafka on Kubernetes on Azure. You can find links to other posts in the series here. All code is available in my Github.

In this first part, I'll introduce the central technologies used in the rest of the series.

Motivation

I usually work with Platform-as-a-Service (PaaS) by choice. In Azure this means running application in, for example, Azure Web Apps or Azure Functions instead of Kubernetes.

I like having the cloud provider take care of hairy details like certificates and intra-cluster communications. I like using Azure AD for all authentication and authorization, both for users and services, instead of setting up certificate authorities and worrying about rotating keys. I like sensible defaults instead of a laundry list of possible configurations.

I will need to look at many of these hairy details during this series. As such, I will not be exactly in my comfort zone.

So why learn about all of this stuff? Well, firstly, both Kafka and Kubernetes are wildly popular technologies used in many organizations, big and small. So from a pure market value point of view, there are worse things you could spend your time on learning.

Secondly, even though many PaaS services hide the details from us, getting to know what happens behind the curtain is useful. It will enable better decision-making with a better idea of the tradeoffs, help debug weird errors, and give an understanding of what it would take to run these technologies outside of a cloud environment.

In summary, I am excited about what's to come!

The main characters

Let's introduce the two main cluster types that I will be discussing. Now, I'm taking a very simplified, user-centric view of both of these. So don't get me wrong, I greatly appreciate both of these as feats of engineering, but I'll avoid details in this post.

Kafka

Apache Kafka is, in its own words, a "distributed event streaming platform". On a very high level, you have a bunch of topics hosted on brokers, to which producers send messages and from which consumers read messages. From Kafka's point of view, messages are just bytes, so they can be almost anything - it's up to the producers and consumers to assign meaning to the byte stream. These core services allow you to build elaborate systems that pass and process messages between applications.

Kafka is, by design, relatively simple in terms of its services. However, there is a Kafka ecosystem of other services that integrate with Kafka and offer crucial extensions to functionality. Examples include Schema Registry for defining message structure between producers and consumers and Kafka Connect for configuration-based integrations between Kafka and other systems. I will be looking at these in later parts of this blog series.

Kubernetes

Kubernetes, on the other hand, is "an open-source system for automating deployment, scaling, and management of containerized applications". What Kubernetes tries to solve is how to distribute available compute capacity to applications, how to make sure the applications keep running during software and hardware breakages, and how to expose the applications inside and outside of the cluster in a structured way.

In a high-level workflow, you put one or more containers that need to work together into a pod, organize one or more pods into a deployment that defines, for example, the resource allocation, and then expose the deployment as a service. Again, very simplified - there are loads more core concepts in Kubernetes and an infinite amount of extensions and abstractions you can install to your cluster. I will discuss examples later on in this series.

So why run Kafka on Kubernetes?

Every cloud provider has a managed Kubernetes offering available. However, managed Kafka is rare; out of the big players, only AWS has a managed Kafka offering. Therefore, Kafka on Kubernetes allows a broader selection of cloud service providers.

There are also good implementations available to get started quickly. I will be using Strimzi during this series.

Why not go with Azure Event Hubs for Kafka?

Based on the documentation, Azure Event Hub offers transparent support for Kafka workloads, plus a schema registry to boot. So in principle, I could use Event Hubs and forget about running Kafka on Kubernetes altogether.

However, there are two reasons why I'm going with Kubernetes at this stage. Firstly, if you have a hybrid scenario where your solution needs to run on actual Kafka, you'll need to know eventually about many things that you can forget about when using Event Hubs. So better to eat the frog early and develop against something as close to the runtime environment as possible.

Secondly, through this series, I will look at several Kafka ecosystem components that need to run somewhere. So I'll need a platform for the other components, and Kubernetes is a sensible choice, especially for hybrid scenarios.

If, however, you are migrating an on-premise Kafka cluster completely to Azure, then Event Hubs and, for example, Container Apps can make an architecture that's easier to manage. That's something I might revisit in a later post. :)

Hopefully, you found this short introduction interesting! Do join me for part 2 in this series, where I'll set up the initial Azure infrastructure (coming soon)

DEV Community