ChunTing Wu

Posted on Nov 4, 2024

Is there an Alternative to Debezium + Kafka?

#data #architecture #eventdriven #kafka

I asked this question on Reddit a while back and received lots of valuable answers.

Therefore, I've looked into each answer and documented the results in this article.

TL;DR

No, Debezium dominates the market at the moment, despite some drawbacks.

Background Explanation

Why would we want to find an alternative to Debezium? The main reason is we encountered a challenging scenario.

This is a typical scenario for Debezium, where any modifications to the data source are captured and fed into Kafka for downstream processing.

The advantage of this architecture is simple and efficient, ensuring all downstream processes are as real-time as possible.

If the source has a large number of updates, Debezium can scale horizontally until a large number of updates are concentrated in a single table. This is where Debezium hits its limits.

Even though Debezium can scale horizontally, it means the updates originally handled by one process can be distributed to multiple processes. If each table already has a dedicated process, horizontal scaling is no longer feasible.

We are in such a situation, in our environment, even if the machine specification is stretched, the CDC throughput of a single table is capped at 25 MB/s.

This is certainly not a regular case, after all, 25 MB/s change for a single table is quite significant. However, if we encounter a data source that is doing large-scale data migration, this limit can be easily breached.

In order to ensure the real-time performance of our data pipeline downstream, we can only ask the upstream to be merciful when encountering this level of data migration, and try to do a good job of rate limiting.

However, this limitation will greatly reduce the productivity of the upstream developers. On the one hand, they have to add auditing process to their regular maintenance, and on the other hand, they need to develop an additional rate limit for each maintenance.

So let's find a solution.

Solution Overview

The following solutions were gathered from that Reddit article.

The first three solutions are enterprise services without open source, so they're not going to work for us. After all, we're trying to solve a partial use case, not a complete do-over.

Although Estuary Flow says they have a local deployment approach, I couldn't find any information about it.

The fourth solution was to develop a new tool ourselves, which I believe would be a fundamental solution to the problem. After all, Debezium is developed in JAVA, and we should be able to achieve better performance with Golang, Rust, or even C/C++. However, the development cost was too high for us, and it was difficult to start from scratch.

The first four options didn’t meet our needs, but the fifth option caught my attention as a promising solution.

Conduit is an open source data migration platform developed in Golang, and provides a variety of connectors to integrate many data stores. In addition, we can also develop our own converter to do data format preprocessing.

Therefore, I started to test the performance of Conduit.

Expirement Environment

To keep things simple, I used Kafka Connect in place of Debezium. The two are essentially the same but with different dispatchers, and behind the scenes they all use the same library.

Locust is responsible for generating MongoDB changes, then Conduit and Kafka Connect will write to different Kafka topics.

We can observe the writing speed of Kafka topics to determine who has better performance.

The whole experiment environment is as follows.

https://gist.github.com/wirelessr/82a642685d40d78a49a4cdb1ff1cfa9f

I use two of my own packaged images, Conduit and Kafka Connect, which have MongoDB connectors.

It's easy to generate a large amount of change by stuffing MongoDB with a bunch of fat documents and then just changing the value of a field in all the documents.

The locust script used is as follows.

import time

from locust import User, task, between
from faker import Faker
import pymongo

fake = Faker()

class MongoDBUser(User):
    wait_time = between(1, 5)

    def __init__(self, environment):
        super().__init__(environment)

        self.env = environment

    def on_start(self):
        self.client = pymongo.MongoClient(self.host)
        self.db = self.client["test"]
        self.collection = self.db["test_new"]

    @task
    def incr_seq(self):
        response = None
        exception = None
        start_perf_counter = time.perf_counter()
        response_length = 0

        try:
            response = self.collection.update_many(
                {},
                {"$inc": {"seq": 1}}
            )
            response_length = response.matched_count
        except Exception as e:
            exception = e

        self.env.events.request.fire(
            request_type="mongo",
            name="incr seq",
            response_time=(time.perf_counter() - start_perf_counter) * 1000,
            response_length=response_length,
            response=response,
            context=None,
            exception=exception,
        )

Load Test Result

For this test, I used a local machine without fully stressing its CPU or memory, leaving some resources available to avoid errors from performance bottlenecks.

In other words, this test shows the regular ability of a single process to handle a single table load.

Conduit

Kafka Connect

As the results show, Kafka Connect's throughput significantly outperforms Conduit’s when system resources are sufficient.

I was a bit confused about this result, so I repeated the test a few times, but I got similar numbers.

Wrap Up

Back to the question in the title.

Is there an alternative to Debezium + Kafka?

Not at the moment—at least, not among open-source tools.

I've asked on Reddit, but maybe Dev.to will have a different answer, so feel free to offer your solutions.

Top comments (3)

ngochieu642 • Nov 17 '24

Recently Microsoft launched Drasi. It's in cncf sandbox so might worth giving it a look. I followed Dapr since they were lauched several years ago, Microsoft did some great stuff there so hopefully this will have the same kind of work putting into it

azure.microsoft.com/en-us/blog/dra...

ChunTing Wu • Nov 18 '24

Available Sources
Drasi currently provides Sources for the following source systems:

Azure Cosmos DB Gremlin API
PostgreSQL
Event Hubs
Microsoft SQL Server
Microsoft Dataverse

A bit less.

Walter • Jan 27

We ran into this exact question in production.

We originally used Debezium + Kafka + ClickHouse. It worked fine at small scale, but as we grew to thousands of tables across multiple databases, transformations, schema evolution, and recovery became increasingly painful.

I recently wrote a post sharing what didn’t scale well for us and why we eventually moved away from this setup:

medium.com/@guichenchen/why-we-rep...