What is a commit log and why should you care?

#kafka #webdev #programming

Logs are everywhere in software development. Without them there’d be no relational databases, git version control, or most analytics platforms.

For many developers that’s the end of our relationship with logs. We use tools that rely on logs. We turn to logs to understand why something went wrong. And up until fairly recently, you could probably go an entire career without thinking much more deeply about them than that.

Commit logs, in particular, have come out of the wings and into the spotlight. At the heart of tools such as Apache Kafka, commit logs are essential to handling increased volumes of data, bringing coherence to distributed systems, and providing a common source of truth for microservices architectures.

Understanding how commit logs work will help you to build more resilient systems using tools such as Kafka.

It’s a log and that’s about it

The good news is that commit logs are about as straightforward as it gets. They are a sequence of records, each with its own unique identifier. Want to add a record? No problem, it gets appended at the end of the log. Want to change an existing record? That you can’t do; once written, records are immutable.

What about reading? That always happens from left to right. While there’s no query, as such, you can use offsets to specify the start and end points of your read.

And that’s it. We can all go home. Well, perhaps not just yet. As often, the raw “what” is far less interesting than the “why”.

Logs are there even when everyone else has left you

Commit logs have been around for a long while and that’s because they solve a core problem in software development. They provide a source of truth for what has happened in a system and in what order.

Let’s think about a relational database such as PostgreSQL. In Postgres, commit logs are called write ahead logs. Each write to a Postgres database must first be recorded in the write ahead log before the data is changed in either a table or an index. The first benefit is that it speeds up database writes.

Writing to a commit log is relatively fast, even on disk. Writing to more complex data structures, such as a relational table, is necessarily slower. So long as the transaction is recorded in the write ahead log, the changes to indexes and tables can happen in memory, with a page to disk later on. That way, the slow part of writing to the database happens asynchronously. Even if the plug is pulled from the server before the table and index changes are written to disk, the database can be recreated by following the story of what happened and in what order as recorded in the write ahead log.

The bigger benefit, perhaps, is that any database can be recreated from scratch simply by following the write ahead log. Whether that’s as part of disaster recovery or for the purpose of streaming live changes to a read-only replica.

Now, what if that commit log could serve the same purpose and not just for a database but, instead, an entire software architecture?

Logs can handle volume

Let’s say you’re building an ecommerce application. You want to better understand how customers make their way around the site and so you record every click, every search term, every page view.

Regardless of how you later process that data, you first need to capture it. Perhaps the quickest solution to develop would be to store each event in your operational database. But that puts you on the wrong side of a trade-off. The operational database -- most likely a relational database -- has a relatively slow write time. That’s fine because it gives you benefits such as rich query, transactionality, and mutability. But right now all you want to do is capture the data and worry about what to do with it later.

One option would be to put Redis in front of your operational database, so that it can soak up the volume and release slowly at a rate suited to the main database. However, that’s just offsetting the problem and leaves your relatively expensive operational database full of semi-structured data. The visitor data is coming through in huge volumes and there’s no point in increasing the cost and complexity of your operational database in order to handle “nice to have” data such as this.

A commit log is ideal, though. The data you need to store is made up of discrete, ordered events. And the simplicity of commit logs means that they can easily handle far larger volumes of data than a typical relational database.

In fact, this is a typical use case for Apache Kafka, which puts a commit log at the heart of its data model.

Logs are a source of truth

Logs are fast, they are simple, they handle large volumes of data, and they are a source of truth. This makes them ideal for situations where multiple autonomous components form a single system. That could be a full blown microservices architecture or a monolith where you’ve hived off one or two processes.

In these situations, a log does two jobs. One is that it provides a single place for all parts of the system to find changes. The other is that it imposes an order on those changes simply by the fact of how it works. Let’s go back to ecommerce to explore what this means.

Say we have a simple ecommerce application made up of five distinct services: catalog, payments, orders, customers, and shipping. In this situation, a customer order might take this path:

The order system writes the customer’s order to the log.
The catalog system reads the order from the log and checks that there is sufficient stock to fulfill it. It finds there is and so temporarily reduces the stock count accordingly and writes a new record to the log indicating that the order can be fulfilled.
The payment system reads the record showing that the order can be fulfilled and attempts to take payment from the customer’s card. The payment is successful and the payment system writes back to the log that the order can progress.
The shipping system reads the log and prepares the order for dispatch.
The customer system reads the same entry and updates the customer’s record with details of their new order.
A dispatcher in the warehouse marks the package as sent in the shipping system. The shipping system writes to the log that the package has shipped.
The customer system sees that status change from the shipping system and updates the customer’s tracking information accordingly.
The catalog confirms the earlier temporary reduction in stock level.

In reality, there would be many more steps. For example, there might be an entire microservice that kicks into life when a stock reduction message appears in the log and reduces the stock level in the operational database by the right amount. The point is, though, that each of these systems never interacts directly. Each one writes to and reads from the log. That has several benefits.

With the log as the source of truth, there’s no danger that one service might accidentally pre-empt or overrule another because everything happens in order. Another benefit is that, so long as the log is available, the system can theoretically continue even when some components are offline.

Committing to logs

You could build your own log but you’re likely to get a better return on your time using something like Apache Kafka.

Kafka gives you the central benefit of a commit log -- the immutable, ordered record of events -- with the benefits of integrating with common data sources, writing data out to other systems such as Postgres, and the ability to act on and transform the data it processes. There are SDKs for most common languages and if you use a hosted service you don’t need to worry about administering your own Kafka cluster. And, believe me, that's no small task; take a look at how Heroku manages their huge fleet of Kafka instances.

Logs photo by Oliver Paaske
Ringbinders photo by Viktor Talashuk
Truth photo by Magda Ehlers