Klim Markelov

Posted on Apr 18, 2021

Introduction to MySQL replication

#mysql #replication #distributedsystems

Hello, ultra devs! 👋🏻⚡️
Today I would like to talk about Replication and how it works in MySQL.

Let's start from the beginning. If you know what replication is, feel free to skip the next chapter.

What is replication?

Basically, a replication mechanism means storing some copy of data on multiple machines. "How is it different from backups?" you may wonder. Replication is a bit more than that. When a backup is just a snapshot of the data in a certain time, replication helps you to not only keep the copy of the data in real-time ensuring availability but also facilitate the overload on the database providing both reading and writing to the client and therefore increase throughput. Also, replication helps you to distribute your data storage across the globe decreasing the response time for clients from different parts of the world.

In this article, we will be talking about leader-based replication, and to continue our journey into this world, it's required to introduce several terms:

Leader (master) – part of the replication system eligible to write and read from the database.
Follower (replica) – part of the replication system eligible only to read.

Basically, leader is responsible for all inserts, updates, deletes, and once these changes go through the leader, it transfers these changes to all its followers that are responsible for reads and never writes.

Here is a simple example of single-leader replication with two followers:

How it works in MySQL?

Ok, now we know what replication is, but how it actually works in MySQL? How data got transferred from leader to followers and how MySQL keeps the consistency between them?

Imagine the situation that dev.to is powered by MySQL. You just wrote the article and clicked on the Publish button. Here is what happens:

Data comes to the leader and get saved in the database;
The leader saves data changes in the special file called binary log;
Follower copies changes in binary log (binlog) to its own file called relay log;
Follower replays these changes from relay log to its own data.

As you can see, to synchronize relay log with leader's binary log, MySQL starts a worker thread that is called I/O follower thread. It's basically an ordinary client connection to the leader that starts reading its binary log.

Digging a bit into details, we can ask a quite reasonable question: in which format do binary log and relay log store the data?

Replication types

Currently, MySQL supports two types of replication:

Statement-based replication;
Row-based replication.

Statement-based replication

So, as it's clear from the name, statement-based replication records the whole query that changed the state of the data in binlog. So, when a follower decides to synchronize its data with the leader, it copies the query and replays it by executing this query and applying changes to its own data.

This kind of replication is very easy to implement and has multiple advantages:

It still works when the schema is different on the leader and the follower;
It's easy to audit and debug;
It requires not that much disk space.

Of course, with great advantages comes great disadvantages:

Non-deterministic functions. With non-deterministic functions, it can come up with different data on leader and follower. By non-deterministic functions I mean function like the following: CURRENT_USER(), RAND(), IS_FREE_LOCK() and so on. Executing them first on leader and then on follower can lead to inconsistent data;
Performance penalty. Imagine if you execute the following query:

INSERT INTO post_statistics VALUES (
  SELECT status AS statistics_type, COUNT(id) AS posts_amount FROM posts GROUP BY status
)

without having an index on the status field, and after pressing 'Enter' you just went for a tea (a hypothetical situation, I know, you probably drink coffee). The query got executed on leader, consuming all available CPU, and then follower picked up the baton, copied the query to its relay log, and cheerfully ate all CPU as well;

Triggers and stored routies. Triggers and stored routines, as well as Non-deterministic functions, can cause a lot of problems with different side-effects on leader and follower.

So, Statement-based replication has its own advantages, but big disadvantages. Therefore not every database supports this type of replication, but in the case of MySQL, up until MySQL 5.0, this type of replication was the only one supported.

Row-based replication

Compare to Statement-based replication, Row-based replication stores the actual data changes in binary log, but not the query. So, when a follower replicates the data, it doesn't execute the query, but applies the changes to each record it was applied to on leader.

Let's consider the advantages of this approach:

Less CPU intensive. If we execute the query, described in Statement-based replication chapter, follower does not replay this query, but copy the value and apply the change to its own data record. So, the query gets executed once and doesn't consume all available CPU;
Helps to find data inconsistency. Since Row-based replication stores the changes only, when follower replays these changes and tries to apply to the data that exists on leader but doesn't exist on follower, it throws the error. Meanwhile statement-based replication proceeds with what it has and keeps the inconsistency hidden, complicating the ability to find the point of failure and fix it;
No non-deterministic behavior. Compare to statement-based replication, if you execute the query that has Non-deterministic functions, it ends up with the same result for both leader and follower.

Looks nice, that's what we were expecting from replication, right? But along with the advantages come disadvantages:

High disk space consumption. Yeah, we just talked about Less CPU consuming for this replication, but the disk space is also very important. Imagine if you have the following statement:

UPDATE posts SET status = 'draft' WHERE status = 'published'

Considering the fact, that the posts table has about 1.000.000 of Published posts, this query becomes quite expensive, since it requires storing 1.000.000 of changes in binary/relay log files;

Does not allow different schemas. Sometimes it might be useful when you have different schemas on leader and follower (I don't know about these cases, but they definitely exist). As it was described above, row-based replication throws an error in case of data inconsistency, caused by different schemas;
Statement is not included in the binary log. It can be not a problem at all until you try to debug or audit what's going on and what query caused damage to your database. Row-based replication makes it hard to analyze.

Now we are fluent in replication process language. Everything is clear. Hold on, on the picture of replication example, we can see one leader and two followers. Can we do things differently?🕵️‍♀️

MySQL supported topologies

Single-leader replication

This type of replication is the most common one. It is useful when you have a lot of reads but not that many writes. You can distribute users reads among followers load-balancing them and therefore providing better response time. With this replication topology, you can easily add one more follower to it. Also, this topology prevents a lot of problems that multiple leaders topology have (will be described in Leader-leader replication), since it has only one leader.

Leader-leader replication

As it is visible from the picture, this topology involves two leaders.

This topology is useful when you have different data centers in different locations and you need to provide fast writes to both regions.

But with this advantage comes a great cost. Suppose, we have a table post_statistics and you just realized that the number of posts with Published status is triple more than it's actually written in table and you decide to fix the situation. So, you connect to MySQL and execute the following query:

UPDATE posts_statistics SET posts_amount=posts_amount * 3 WHERE status='published'

Meanwhile, somebody from a different part of the world just published his/her first post (just like me) and triggered the following query to be executed:

UPDATE posts_statistics SET posts_amount=posts_amount + 1

Suppose, the original number of posts was 10k. Due to replication lag, databases ended up with two different numbers: 30.001 and 30.003. And no errors were thrown.

This is a big disadvantage of this topology, and in practice, it brings more problems than advantages. But if you ended up with this topology, it's better to add few more replicas to it 😁

Active-passive leader-leader replication

In Active-passive leader-leader replication topology one server takes the role of the leader and another one takes the role of the follower. But in comparison to the ordinary leader-follower topology, it allows you to swap easily the leader responsibility from one server to another.

It's useful in many cases. For example, if you execute ALTER TABLE that locks the whole table for reads and writes, you can stop the replication process, easily swap leader responsibility, execute ALTER TABLE on the passive server, then swap the responsibility back, restore replication process, and execute ALTER TABLE on the remaining server. It can help you to keep your service alive while executing that expensive query.

Other topologies

There are many other topologies that are supported by MySQL:

Replication Ring topology
Tree of pyramid topology

And many others. You can choose the best topology that fits your purposes or create your own. This is the list 👆🏻 of the most common topologies used in MySQL.

Summary

Replication is a mechanism of having a consistent copy of the data storage. It provides:

Data distribution;
Load balancing;
Backups;
High availability and failover.

Leader-based replication consists of leader and follower. Both of them have their own journal of changes: binary log and relay log.

There are two types of replication:

Statement-based replication. It's represented in queries itself.
Row-based replication. It's represented in direct data changes.

There are multiple topologies for replication:

Leader-follower topology;
Leader-leader topology;
Leader-leader active-passive topology;
Ring topology;
Tree of Pyramids topology. And many specialized topologies together with custom ones.

That's it! Thank you for your attention! I hope you liked this post 😌

Top comments (4)

Iago Angelim Costa Cavalcante • Jun 1

I'm thinking implement something like this, but I have a special case here, I need to bidrectional data flow, for example. When I change something on the web this changed data will be reflected on desktop app (its offline, and syncronize with cloud every 1min). Do you have some considerations for this? Any study I can read?