Understanding Database Replication

#database #computerscience #tutorial #architecture

Imagine you are a software developer at a big e-commerce company and "Black Friday" is coming soon, you must prepare your database, prepare your system for this event, otherwise users will have a really bad experience, with a very slow system and in the worst case the system might fall!

You don't want this, so you must find some way to improve your system performance and availability. But how to do this? With database replication.

What is database replication?

Database replication is a technique that consists of keeping a copy of the same data in multiple machines that are connected via a network. In other words, you will have a database and many identical clones, with the exact same data.

Replication may seem easy, simply just copy data from one point to another. But how to copy the data? How to automate this process? A big system has new data almost every time, so how to handle changes to replicated data? To solve these problems, we have many replication algorithms, which will be explained in this article.

When to use database replication?

When do you need:

To keep data geographically close to your users, achieving lower latency
To ensure that the system will continue to work even if some parts fail, increasing availability
To scale the number of machines that can be accessed to read data

Leader/follower pattern

Leader/follower pattern is the basis of almost all replication algorithms. Think in your database as a collection of nodes, where each node is a machine/container.

If we want replication, all the nodes must have the exact same data, so nodes are replicas, because they hold a copy of the database. Every new record in the database must be copied to every replica to maintain the consistency, and the most common approach for this is called leader-based replication.

Leader-based Replication

One replica among the others is elected as the leader. The application will always write to the leader first.
The rest of the replicas are known as followers . Basically, the followers receive tha data change from the leader and then updates its data.
When an application wants to read from the database, it can query either the leader or any of the followers. Although, write operations remain allowed just on the leader.

Summing up, a write request using this pattern would look like this:

But what about read requests? They can be processed by any of the replicas, leader or follower. For read-heavy applications, letting all the pressure on the leader isn't a good idea, in this case a possible solution is to expose the followers to the client. When you expose the followers to the client, they become read replicas, which improves the performance, because now the read requests are distributed among many nodes.

The image below shows how the architecture would be with read-replicas. A bar was added between the replicas and the client, this bar is just an algorithm to choose which replica will be queried, there are many algorithms to achieve this, so you can research about it later.

If your application is not read-heavy, you don't need to do this, you can let the write and read operations together on the leader replica.

Synchronous vs Asynchronous Replication

When replicating data, you can do it synchronously or asynchronously. You just have to understand the trade-offs and then make your choice.

Synchronous replication

Synchronous replication is a process of writing data to two systems at once, rather than one at a time. For example, the client send new data to the database, with sync replication, this data will be written to all replicas, at the same time. Synchronous replication allows simultaneos updates of multiple replicas.

The advantage is that the followers are guaranted to have an up-to-date copy of the data, which means consistency. If, for some reason the leader fails, the followers will have updated data, enabling the system to elect a new leader and then continue its operation normally.

The disadvantage is that if the synchronous follower doesn't respond, it will lead to a blocking process, new writes won't be processed until the follower responds and the process is unlocked. This behavior is recommended to maintain the consistency, so the leader will block all writes until the synchronous follower is available again.

Asynchronous replication

Asynchronous replication is a data storage backup technique where data is not immediately backed up during or immediately after the leader acknowledges write complete, but rather done over a period of time. For example, scheduled backups with a fixed interval, 5 minutes, 10 minutes etc.

The disadvantage is that if the leader fails and is not recoverable, any writes that have been made in this interval and not replicated to the followers yet, are LOST.

Semi-synchronous replication

As the name says, this configuration combines sync and async replication, so you can enjoy the best of both worlds. It consists in some synchronous replicas and some asynchronous replicas.

Leader failure

You have scaled and improved your e-commerce platform to handle a big event like Black Friday. The Black Friday began, you have sold many products and everything is going well, but suddenly you receive a notification alarm telling you that the leader replica is down and is not recoverable, what are the consequences?

Leaders are the write replicas of the database, so if the leader is down, no new writes can be done, it means that your customers won't be able to buy any new product in your e-commerce. To avoid this, you have to adopt the failover
process. Failover is a process that consists of:

Promoting a follower to be the new leader (election)
Reconfigure the clients to send their writes to the new leader
Reconfigure the followers to start consuming data changes from the new leader

The failover process can be done manually or automatically, it will depend on your configuration.

Multi-Leader and Leaderless replication

Multi-leader replication

Multi-leader replication is an alternative to single-leader replication (default). With single-leader replication there is just one leader, so it means that a single node will receive all the systems's writes. If this single leader fails, you can't write to the database until a new leader is elected.

Multi-leader replication solves this problem by allowing more than one node to accept writes. But how to maintain synced data across multiple leaders? There are three main techniques to achieve this:

Circular topology: a circle, one node writes to the next node
Star topology: like a tree, with a root that writes to the nodes top-down (leaves)
All-to-all topology: every node writes to every node

Leaderless replication

Leaderless replication is a replication that goes against the default pattern of leader-based replications. Nodes in the leaderless setting are considered peers and all of them accept writes and reads from the client. Without a leader that handles all write requests.

By adopting this strategy, your system will have availability improved, since the traffic will be better distributed along many nodes, but you might lose consistency, since the policies of writing are not so restrict. It is important to analyze which strategy is better for your product.

Summary

It's very important to take care of our data, and replication is a great way to improve the availability and performance of your database. There are many strategies to achieve this, and you can choose the one that fits better for your product.

In this article we introduced the concept of replication, and we saw the leader/follower pattern, which is the basis of almost all replication algorithms. We also saw the synchronous and asynchronous replication, and the semi-synchronous replication, which combines the best of both worlds. We also saw the failover process, which is the process to elect a new leader in case of failure. Finally, we saw the multi-leader and leaderless replication, which are alternatives to the default leader-based replication.

Although database replication is a great strategy to improve the availability and performance of your database, it is not a silver bullet, you have to understand the trade-offs and then make your choice. Systems with a very high throughput and extremely large datasets might not benefit from replication and you might have to use "Database Partitioning" instead.