DEV Community

Cover image for Bidirectional topic mirroring
Andrea
Andrea

Posted on

Bidirectional topic mirroring

The tool of choice when it comes to disaster recovery in Kafka, is Mirror Maker. It's often configured to keep a stand-by cluster full of the latest data to take over just in case something happens to the main cluster.

The same tool is also used for geo-replication. In this case some selected topics are sent to another region to have consistency of data between the clusters and deliver an updated state.

This last setup however only works in one direction normally and, while Mirror Maker does support bidirectional replication, it does so in an asymmetric way by appending the name of the originating cluster to the topic, this is done to avoid the infinite mirror effect of a message being replicated back and forth. Consumers and producers have to be aware of this asymmetry and the topology has to be more complex than necessary.
Image description

This spurred in a few cases where the setups used custom code to manage aggregating the events. Complexity is still lurking because of the inherent manual setup of Mirror Maker to manage these bidirectional flows effectively. The main challenge is the duplication and potential loss of messages due to network issues or misconfigurations. These complexities are not only operational but also introduce a high risk of data inconsistency between clusters.

An alternative approach to Mirror Maker for bidirectional topic mirroring in Kafka is to use more advanced replication tools or services that are designed with bidirectionality in mind from the start. Tools like Confluent Replicator offer enhanced features over Mirror Maker, including improved fault tolerance, easier configuration, and better support for bidirectional replication. However, these are usually part of a cloud offering and have a price tag associated.

Today I tried some alternatives with Apache Pulsar and have prepared a quick PoC you can try yourself.

Setting Up Bidirectional Mirroring with Pulsar

Quick Guide: Follow these streamlined steps to configure bidirectional topic mirroring using Apache Pulsar.

Following the setup outlined in the README from the pulsar-geo-replication GitHub project linked above, let's walk through the commands necessary to set up bidirectional topic mirroring using Apache Pulsar. This guide assumes you have already cloned the project and have Docker and Docker-compose installed on your system.

Start the Pulsar Clusters

Navigate to the project directory where the docker-compose.yml file is located. Initiate the clusters by running:

docker-compose up
Enter fullscreen mode Exit fullscreen mode

This command starts two Pulsar clusters, named cluster-a and cluster-b, running in Docker containers.

Connect the Clusters

After the clusters are up, you need to establish a connection between them. Execute the following commands to link cluster-a to cluster-b and vice versa:

./pulsar-admin --admin-url http://localhost:8080 clusters create cluster-b --broker-url pulsar://broker-edge1:6650 --url http://broker-edge1:8080
./pulsar-admin --admin-url http://localhost:8081 clusters create cluster-a --broker-url pulsar://broker:6650 --url http://broker:8080
Enter fullscreen mode Exit fullscreen mode

These commands configure each cluster to recognize the other, enabling them to replicate data between each other.

Create Tenants and Namespaces

Next, set up tenants and namespaces in both clusters to facilitate the replication:

./pulsar-admin --admin-url http://localhost:8080 tenants create edge1 --allowed-clusters cluster-a,cluster-b
./pulsar-admin --admin-url http://localhost:8081 tenants create edge1 --allowed-clusters cluster-a,cluster-b
./pulsar-admin --admin-url http://localhost:8080 namespaces create edge1/replicated --clusters cluster-a,cluster-b
./pulsar-admin --admin-url http://localhost:8081 namespaces create edge1/replicated --clusters cluster-a,cluster-b
Enter fullscreen mode Exit fullscreen mode

This configuration ensures that any topic created in the edge1/replicated namespace will automatically be replicated across both clusters.

Topic Creation

Now, let's create a topic in the replicated namespace:

./pulsar-admin --admin-url http://localhost:8080 topics create persistent://edge1/replicated/events
Enter fullscreen mode Exit fullscreen mode

This topic will serve as the conduit for messages meant to be mirrored between the clusters.

Testing the Setup

Open a consumer for both cluster to check if the messages will come through:

./pulsar-client --url http://localhost:8080 --listener-name external consume --subscription-name "sub-a" persistent://edge1/replicated/events -n 0
Enter fullscreen mode Exit fullscreen mode

The listener name is imporant because we can't access the Docker DNS from the host machine, so we configured in the docker-compose.yaml an external listener that is advertised from the broker to the pulsar client, which the pulsar client will then use to connect to the broker after the first connection.
Open a second terminal and start listening for messages on cluster b:

./pulsar-client --url http://localhost:8081 --listener-name external consume --subscription-name "sub-b" persistent://edge1/replicated/events -n 0
Enter fullscreen mode Exit fullscreen mode

and in a third terminal you produce a message:

./pulsar-client --url http://localhost:8080 --listener-name external produce persistent://edge1/replicated/events --messages "Hello world produced to cluster a"
Enter fullscreen mode Exit fullscreen mode

You will see the same message in both clusters.

Extra

It's also possible to avoid replicating on a per-message basis:

./pulsar-client --url http://localhost:8081 --listener-name external produce --disable-replication persistent://edge1/replicated/events --messages "Hello world produced to cluster b not replicated"
Enter fullscreen mode Exit fullscreen mode

And you will see the message only on cluster b, while it will never appear on cluster a.

By following these steps, you've successfully set up bidirectional topic mirroring with Apache Pulsar. This method offers a more straightforward and reliable approach to data replication across clusters, ensuring high availability and consistency for distributed systems.

Top comments (0)