DEV Community

Cover image for Tips for Designing Apache Kafka Message Payloads
Lorna Jane Mitchell
Lorna Jane Mitchell

Posted on • Originally published at aiven.io

Tips for Designing Apache Kafka Message Payloads

Event-Driven systems are increasingly our future and that's one reason why so many developers are adding Apache Kafka to their tech stacks. Getting the different components in your systems talking nicely to one another relies on a rather mundane but crucial detail: a good data structure in the message payloads. This article will pick out some of the best advice we have for getting your Apache Kafka data payloads well designed from the very beginning of your project.

Use all the features of Apache Kafka records

The events that we stream with Kafka can support headers as well as keys and the main body of the payload. The most scalable systems use all these features appropriately.

diagram showing the header, key and value as boxes inside the payload

Use the header for metadata about the payload, such as the OpenTelemetry trace IDs. It can also be useful to duplicate some of the fields from the payload itself, if they are used for routing or filtering the data. In secure systems, intermediate components may not have access to the whole payload, so putting the data in the header can expose just the appropriate fields there. Also consider that, for larger payloads, the overhead of deserializing can be non-trivial. Being able to access just a couple of fields while keeping the system moving can help performance, too.

The keys in Apache Kafka typically do get more attention than the headers, but we should still make sure we are using them as a force for good. When a producer sends data to Kafka, it specifies which topic it should be sent to. The key usually defines which partition is used. If the key isn't set, then the data will be spread evenly across the partitions using a round-robin approach. For a lot of unrelated events in a stream, this makes good use of your resources.

If the key you're using doesn't vary much, your events can get bunched into a small number of partitions (rather than spread out). When this happens, try adding more fields to give more granular partition routing. Keep in mind that the contents of each partition will be processed in order, so it still makes sense to keep logical groupings of data.

For example, consider a collection of imaginary factories where all the machines can send events. Mostly they send sensor_reading events, but they can also send alarm events, that are like a paper jam in the printer but on a factory scale! Using a key like this will give us a LOT of data on one partition:

{
    "type": "sensor_reading"
}
Enter fullscreen mode Exit fullscreen mode

So we could add another field to the key for these readings, maybe to group them by factory location:

{
    "type": "sensor_reading",
    "factory_id": 44891
}
Enter fullscreen mode Exit fullscreen mode

Combining the type and factory in the key ensures that records of the same event type and the same factory will be processed in the order they were received. When it comes to designing the payloads, thinking about these aspects early on in the implementation process can help avoid performance bottlenecks later.

Data structures: nested data or simple layout?

No matter how certain I am that this payload will only ever contain a collection of things, I always use an object structure rather than making the data an array at the top level. Sometimes, it just leaves a rather lonely fieldname with a collection to take care of. But when things change and I do need to add an extra field, this "one weird trick" makes me very grateful.

Make no mistake, it's not foresight. It's the scars of the first API I ever shipped having to move to v1.1 within a week of launch for precisely this reason. Learn from my mistakes!

In general, it's also helpful to group related fields together; once you get 30 fields in a payload and they are sorted alphabetically then you will wish you had done something differently! Here's an example showing what I mean:

  {
    "stores_request_id": 10004352789,
    "parent_order": {
      "order_ref": 777289,
      "agent": "Mr Thing (1185)"
    },
    "bom": [
      {"part": "hinge_cup_sg7", "quantity": 18},
      {"part": "worktop_kit_sm", "quantity": 1},
      {"part": "softcls_norm2", "quantity": 9}
  ]}
Enter fullscreen mode Exit fullscreen mode

Using the parent_order object to keep the order ref, its responsible person, and any other related data together makes for an easily navigable structure, more so than having those fields scattered across the payload. It also avoids having to artificially group the fields using a prefix. Don't be afraid to introduce extra levels of data nesting to keep your data logically organised.

How much data to include is another tricky subject. With most Kafka platforms limiting payloads to 1MB, it's important to choose carefully. For text-based data, 1MB is quite a lot of information, especially if a binary format such as Avro or Protobuf is used (more on those in a moment). As a general rule of thumb, if you are trying to send a file in a Kafka payload, you are probably doing it wrong!

These design tradeoffs are nothing new and I rely mostly on the prior art in the API/webhooks space to inform my decisions. For example, hypermedia is the practice of including links to resources rather than the whole resource. Publishing messages that will cause every subscriber to make follow-on calls is a good way to create load problems for your server but hypermedia can be a useful middle ground, especially where the linked resources are cacheable.

Data Formats: JSON, Avro ... these are not real words

Wading through the jargon of data formats is a mission by itself, but I'd like to give some special mentions to my favourites!

JSON: Keep it simple JSON formats are very easy to understand, write, read and debug. They can use a JSON Schema to ensure they fulfil an expected data structure, but you can equally well go freeform for prototyping and iterating quickly. For small data payloads, I often start here and never travel any further. However, JSON is fairly large in size for the amount of data it transmits, and it also has a rather relaxed relationship with data types. In applications where either or both of these issues cause a problem, then I move on from JSON and choose something a bit more advanced.

Avro: Small and schema-driven Apache Avro is a serialisation system that keeps the data tidy and small, which is ideal for Kafka records. The data structure is described with a schema (example below) and messages can only be created if they conform with the requirements of the schema. The producer takes the data and the schema, produces a message that goes to the kafka broker, and registers the schema with a schema registry. The consumers do the same in reverse: take the message, ask the schema registry for the schema, and assemble the full data structure. Avro has a strong respect for data types, requires all payloads conform with the schema, and since data such as fieldnames is encoded in the schema rather than repeated in every payload, the overall payload size is reduced.

Here's an example Avro schema:

{
    "namespace": "io.aiven.example",
    "type": "record",
    "name": "MachineSensor",
    "fields": [
        {"name": "machine", "type": "string", "doc": "The machine whose sensor this is"},
        {"name": "sensor", "type": "string", "doc": "Which sensor was read"},
        {"name": "value", "type": "float", "doc": "Sensor reading"},
        {"name": "units", "type": "string", "doc": "Measurement units"}
    ]
}
Enter fullscreen mode Exit fullscreen mode

There are other alternatives, notably Protocol Buffers, known as ProtoBuf. It achieves similar goals by generating code to use in your own application, making it available on fewer tech stacks. If it's available for yours, it's worth a look.

A note on timestamps

Kafka will add a publish time in the header of a message. However it can also be useful to include your own timestamps for some situations, such as when the data is gathered at a different time to when it is published, or when a retry implementation is needed. Also since using Apache Kafka allows additional consumers to reprocess records later, a timestamp can give a handy insight into progress through an existing data set.

If I could make rules, I'd make rules about timestamp formats! The only acceptable formats are:

  • Seconds since the epoch 1615910306
  • ISO 8601 format 2021-05-11T10:58:26Z including timezone information, I should not have to know where on the planet on which day of the year this payload was created.

Design with intention

With the size limitations on the payloads supported by Apache Kafka, it's important to only include fields that can justify their own inclusion. When the consumers of the data are known, it's easier to plan for their context and likely use cases. When they're not, that's a more difficult assignment but the tips shared here will hopefully set you on a road to success.

Further reading

If you found this post useful, how about one of these resources to read next?

Top comments (0)