Amir Keshavarz

Posted on Aug 15, 2021

A Closer Look Into The World of Database Replication

#tutorial #computerscience #database #devops

We all have used or heard about database replication at some point but have you ever wondered how It works underneath?

That's exactly what we'll be doing here.

Replication is basically copying an object from one space into another. Replicating immutable objects like a movie file is as simple as copying its bytes.

Replication a database which is a mutable object comes with its own challenges. How do you copy an object that is always changing? You don't!

Logs

Logs are the most important part of a database replication process. The basic idea of a log is to record every action database does. We should be able to replay these logs to reconstruct a corrupt database object from scratch so It makes sense to keep them in order.

Let's imagine a key/value store. In this database, we have 3 different types of operation:

SET: Sets a value to a key
GET: Retrieves the value of a key
DELETE: Delete a key

These are the instruction set of our database. Now, let's use them.

I want to set the value of john to the key name, so my operation log would be:

{
  "operation": "SET",
  "key": "name",
  "value": "john"
}

Now let's change the value of that same key to eli:

{
  "operation": "SET",
  "key": "name",
  "value": "eli"
}

And at last, let's remove it:

{
  "operation": "DELETE",
  "key": "name"
}

As you see, logs are only the operations that mutate the database in some way. But something is missing from these logs. Order!

We have to somehow not only keep them in order but to be able to recover from a specific position in the logs.

2 obvious options are:

Incremental integer: Assign an incrementing number to each log
Timestamp: Record the timestamp when the log was executed

We go with the first option for now. We executed 3 operations so the logs will be:

[
  {
    "id": 1,
    "operation": "SET",
    "key": "name",
    "value": "john"
  },
  {
    "id": 2,
    "operation": "SET",
    "key": "name",
    "value": "eli"
  },
  {
    "id": 3,
    "operation": "DELETE",
    "key": "name"
  }
]

Now our logs are ordered and seekable from an id. We can recover the database at any point in time.

I think you get the idea now. When we talk about replication, we're talking about copying these logs into another space. If we're replication the database into another server, we continuously send these logs to the second server and tell it to replay these logs exactly. If the connection drops or something bad happens we always know where to start again because we already have a sortable id.

We now know how databases share their state with other instances. In the next section, we talk about 2 very basic types of replication.

Types

Synchronous

In synchronous replication, we commit changes to all of the instances before getting back to the client.

Imagine we have 3 instances in a database cluster and we want to use the synchronous replication type.

When we send a SET operation to instance A which is the master, It immediately sends the log to other replica instances. These replicas execute the log and let the master know that they've executed the logs. After making sure all replicas have executed the change, the master itself also executes the log and returns a response to let the user know that the change was done successfully.

The advantage of this type is the level of consistency that we get by executing the change on all instances before committing the change.

And the disadvantage is the overhead in latency which causes the whole process to be slow.

Asynchronous

Unlike the synchronous type, in this type, we can't guarantee consistency because we consider the change committed before it's executed on other instances.

As before, imagine we have 3 instances in a database cluster and we want to use an asynchronous replication type.

When we send a SET operation to instance A which is the master, It generates a log and executes it. After that, the asynchronous process of sending the logs to other instances begins but the master won't wait on these and it immediately gets back to the user with a successful response.

The advantage of this type is the low latency write operations since it only mutates the master before being considered committed.

The disadvantage of this type is the lack of consistency guarantee since we can't be sure the replication takes place at all. Master might be caught on fire before getting the chance to send log to other replicas :)

Consistency

We talked about how the replication process can be synchronous or asynchronous. But let's get back one step backward and discuss consistency from distributed systems' point of view.

In a distributed system like a replicated database cluster, we need some type of consistency because that's the whole purpose of replication!

Consistency is achieved when multiple nodes agree on a single state.

Let's show it by an example.

The state of node A is:

[
  {
    "key": "name",
    "value": "eli"
  }
]

The state of node B also is:

[
  {
    "key": "name",
    "value": "eli"
  }
]

As you can see, both nodes agree that the value of name is eli so they're consistent.

There are multiple levels of consistency and at the most basic level, it correlates to the replications types.

We'll discuss 2 types of consistency here.

Strong Consistency

In a distributed system that supports strong consistency, all nodes are always consistent at any given moment. It doesn't matter what node you write to or read from; the result is always guaranteed to be the same.

This consistency is achieved using consensus protocols like Paxos or Raft.

Weak Consistency

In many cases, we don't really need strong consistency or need some other property that can't coexist with strong consistency.

In this type of consistency, there's no guarantee that the state of your replica is up-to-date with the master.

Eventual Consistency

Eventual consistency is a type of weak consistency itself. As the name suggests, eventual consistency guarantees that even though replicas may not be up-to-date but will be consistent with the master eventually.

Basically, if you stop writing to the master, after some time, all nodes will be consistent and all replicas will answer the most recent values to you.

Sync Replication

We already talked about synchronous and asynchronous replication but in this section, we discuss in more detail and also about its real-world implementation.

Before committing any transaction, we need to make sure all nodes are on the same page. That's just consensus so we need one.

Synchronous replication requires a round-trip from master to most or all nodes for every single log and that's why synchronous replication can be really slow compared to asynchronous replication. Strong consistency has a huge cost!

The most famous consensus protocol is Paxos but It's not the easiest protocol to explain. There's a new shiny protocol called Raft that has gotten lots of traction in the past few years and lots of big projects like etcd are already using it. It's a lot easier to learn compared to Paxos.

Raft

Raft is a relatively new asymmetric consensus protocol designed to be really easy to understand.

Nodes in Raft are always in one of these states:

Follower
Candidate
Leader

One of the most important things in Raft is the fact that followers trust the leader and will redirect all write operations to the leader. The leader will coordinate all changes.

Raft slices the time into terms. Think of them exactly as election terms. At the beginning of the term, nodes become candidates, vote for themself and ask others to vote for them. After some time, a leader is elected and Raft begins to exchange messages between nodes.

An election timeout is specified to make sure there's always a leader present. This timeout is pushed back by ping messages by the leader but If the leader steps back or has failed, another election will take place and all of those steps are repeated until another leader is elected.

Real-world

The etcd key/value store which powers Kubernetes is a good example here. It uses Raft to replicate logs synchronously to achieve strong consistency.

You can see the etcd implementation of Raft in here:
https://github.com/etcd-io/etcd/tree/main/raft

Asynchronous replication

Ok, let's talk a bit more about asynchronous replication. You now know how asynchronous replication works in theory. Some logs are sent asynchronously and there's no guarantee of consistency at any given moment.

We'll discuss Ordered Log Replication here which is really simple with a few challenges.

Ordered Log Replication

As the name suggests, this technique involves persisting an ordered list of logs and sending them over the wire, asynchronously.

Ordering means that writes are usually written to master to ensure the consistency of end result. All logs are tagged with a timestamp or an incremental number.

Immediately you notice a bottleneck here since all logs need to be ordered. Some databases use sharding in order to increase the throughput.

Each of these challenges can be a post themself and most of them are also present in the synchronous type of replication.

Real-world

MySQL replication is a really good example of this type of replication.

MySQL keeps an ordered log of every change in the master by using an incremental integer known as position.

Logs are kept in binlog-* files and are rotated regularly. Replicas (known as slaves in MySQL - fortunately, it's about to change) receive these logs and execute them.

The basic process of adding replicas to MySQL is a little bit too hard. It requires you to take a snapshot of the database and recording the master position and then restoring that snapshot in replica and starting the replication process from that exact position. Lots of other databases do that same thing underneath but I think MySQL could do these without requiring users to do these tasks manually.

Conclusion

In this post, we learned a bit more about database replication and its different types. We discussed different levels of consistency and how replication works in the real world.

I might write more about databases in the future. Let me know If you're interested :)

Thank you for your time.