Frits Hoogland for YugabyteDB

Posted on Feb 1, 2023

Watching a YugabyteDB table: replica destruction and recovery

#yugabytedb #disasterrecover

In the previous articles we seen how the replication factor (RF) works, how tablets and their replicas work based on the replication factor, and what happens when a minority fails (the tablet remains serving) and when a majority fails (the tablet goes down).

But what happens when a minority fails and is destroyed 'forever'?

I started up minimalistic YugabyteDB cluster with 3 tablet servers:

➜ yb_stats --print-tablet-servers
yb-1.local:9000      ALIVE Placement: local.local.local1
                     HB time: 0.9s, Uptime: 487, Ram 33.90 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 166162432 (1.55%)
yb-2.local:9000      ALIVE Placement: local.local.local2
                     HB time: 0.9s, Uptime: 487, Ram 33.10 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 166121472 (1.55%)
yb-3.local:9000      ALIVE Placement: local.local.local3
                     HB time: 0.5s, Uptime: 486, Ram 34.31 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 166395904 (1.55%)

For the sake of the demo, we need to have actual user objects.
Let's create a simple table, and fill it with data:

create table test (id int primary key, f1 text) split into 1 tablets;
insert into test select id, repeat('x',1000) from generate_series(1,1000000) id;

Let's see how that looks like:

➜ yb_stats --print-entities
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING
    Replicas: (yb-2.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER), yb-1.local:9100(VOTER),)

One object (ysql.yugabyte.test), with one tablet, with 3 replicas. All three replicas are 'VOTER', the replica on yb-2.local is LEADER.

Exterminate!

Now let's simulate permanent tablet server 'destruction'. This is a combination of stopping the tablet server:

sudo systemctl stop yb-tserver

Which is what is shown in
Watching a YugabyteDB table: tablets and replicas.

...and the second step is to delete the tablet server data, which is stored at the location indicated by the parameter fs_data_dirs:

fs_data_dirs/pg_data (intermediate postgres files for YSQL)
fs_data_dirs/yb-data/tserver Obviously this makes the data from the tablet server be gone. In my cluster this done in this way:

sudo rm -rf /mnt/d0/pg_data
sudo rm -rf/mnt/d0/yb-data/tserver

Important!

But there is more, which is important to know: if the tablet server now is started again, it will initialise itself because there is no prior data, and start as a completely new tablet server, despite it having the exact same hostname and port number. This is visible using the tablet server screen in the master (master/tablet-servers) with the tablet server UUID: that has changed, making it a totally different server for the cluster.

The situation after termination

I stopped the tablet server on my node yb-3.local (my "third node"). After tserver_unresponsive_timeout_ms time, the tablet server will be considered DEAD, and the thus the tablet under replicated:

➜ yb_stats --print-entities
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING [UNDER REPLICATED]
    Replicas: (yb-2.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER[DEAD]), yb-1.local:9100(VOTER),)

And because the number of unavailable replicas is a minority, it will be removed from the tablet administration after follower_unavailable_considered_failed_sec.

Adding the tablet server

Because we removed the data directories, the tablet server on the third node will be a completely new tablet server, which for the cluster means it has a different UUID. The cluster does not take the server name or ip address into account for data usage, it uses the UUID exclusively.

Start up the tablet server:

sudo systemctl start yb-tserver

If now the database objects on the YugabyteDB cluster are queried again you will find the tablet having added the 3rd replica:

➜ yb_stats --print-entities
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING
    Replicas: (yb-2.local:9100(VOTER:LEADER), yb-1.local:9100(VOTER), yb-3.local:9100(VOTER),)

This is great, but you should be aware a lot of things happened automatically, which the YugabyteDB cluster handled for you:

The situation before starting the tablet server that we "exterminated" by removing the data from it was (for the user table) that:

The table is using a single tablet, and there was 1 user tablet in this test scenario.
The tablet is using replication factor 3, so for it to function we need a majority (2 replicas) at least, which allows a leader to be elected, which was the case.

When the tablet server was started:

The cluster (load balancer) found a placement option for the third replica.
The third replica was placed on the tablet server using a process called 'remote bootstrapping'.
After the replica was bootstrapped, the data was transferred.
The newly added replica was synchronised and added to the tablet, making it a fully functioning replica again.

Warning

On my test cluster, this mostly happened. My test cluster is sized smaller than the minimal sizing rules found here: hardware requirements.
As a consequence, the remote bootstrapping put the replica in the status 'PRE_VOTER', and tried allocating a buffer for data transfer, which it couldn't because the memory area was too small, and therefore stopped after bootstrapping in the PRE_VOTER state. To solve this issue, I had to add --remote_bootstrap_max_chunk_size=20000000 to the tablet server configuration file, this will be the subject of another blogpost.

Conclusion

This blogpost shows one of the key strengths of a YugabyteDB database cluster. Completely transparently to the database APIs, we simulated a failure of a node, which didn't interrupt availability of the database, and once we introduced a "replacement" node, the database repaired itself.

DEV Community