DEV Community

Frits Hoogland for YugabyteDB

Posted on

Watching a YugabyteDB table: replica destruction and recovery

In the previous articles we seen how the replication factor (RF) works, how tablets and their replicas work based on the replication factor, and what happens when a minority fails (the tablet remains serving) and when a majority fails (the tablet goes down).

But what happens when a minority fails and is destroyed 'forever'?

I started up minimalistic YugabyteDB cluster with 3 tablet servers:

➜ yb_stats --print-tablet-servers
yb-1.local:9000      ALIVE Placement: local.local.local1
                     HB time: 0.9s, Uptime: 487, Ram 33.90 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 166162432 (1.55%)
yb-2.local:9000      ALIVE Placement: local.local.local2
                     HB time: 0.9s, Uptime: 487, Ram 33.10 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 166121472 (1.55%)
yb-3.local:9000      ALIVE Placement: local.local.local3
                     HB time: 0.5s, Uptime: 486, Ram 34.31 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 24, user (leader/total): 0/0, system (leader/total): 8/24
                     Path: /mnt/d0, total: 10724835328, used: 166395904 (1.55%)
Enter fullscreen mode Exit fullscreen mode

For the sake of the demo, we need to have actual user objects.
Let's create a simple table, and fill it with data:

create table test (id int primary key, f1 text) split into 1 tablets;
insert into test select id, repeat('x',1000) from generate_series(1,1000000) id;
Enter fullscreen mode Exit fullscreen mode

Let's see how that looks like:

➜ yb_stats --print-entities
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING
    Replicas: (yb-2.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER), yb-1.local:9100(VOTER),)
Enter fullscreen mode Exit fullscreen mode

One object (ysql.yugabyte.test), with one tablet, with 3 replicas. All three replicas are 'VOTER', the replica on yb-2.local is LEADER.

Exterminate!

Now let's simulate permanent tablet server 'destruction'. This is a combination of stopping the tablet server:

sudo systemctl stop yb-tserver
Enter fullscreen mode Exit fullscreen mode

Which is what is shown in
Watching a YugabyteDB table: tablets and replicas
.

...and the second step is to delete the tablet server data, which is stored at the location indicated by the parameter fs_data_dirs:

  • fs_data_dirs/pg_data (intermediate postgres files for YSQL)
  • fs_data_dirs/yb-data/tserver Obviously this makes the data from the tablet server be gone. In my cluster this done in this way:
sudo rm -rf /mnt/d0/pg_data
sudo rm -rf/mnt/d0/yb-data/tserver
Enter fullscreen mode Exit fullscreen mode

Important!

But there is more, which is important to know: if the tablet server now is started again, it will initialise itself because there is no prior data, and start as a completely new tablet server, despite it having the exact same hostname and port number. This is visible using the tablet server screen in the master (master/tablet-servers) with the tablet server UUID: that has changed, making it a totally different server for the cluster.

The situation after termination

I stopped the tablet server on my node yb-3.local (my "third node"). After tserver_unresponsive_timeout_ms time, the tablet server will be considered DEAD, and the thus the tablet under replicated:

➜ yb_stats --print-entities
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING [UNDER REPLICATED]
    Replicas: (yb-2.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER[DEAD]), yb-1.local:9100(VOTER),)
Enter fullscreen mode Exit fullscreen mode

And because the number of unavailable replicas is a minority, it will be removed from the tablet administration after follower_unavailable_considered_failed_sec.

Adding the tablet server

Because we removed the data directories, the tablet server on the third node will be a completely new tablet server, which for the cluster means it has a different UUID. The cluster does not take the server name or ip address into account for data usage, it uses the UUID exclusively.

Start up the tablet server:

sudo systemctl start yb-tserver
Enter fullscreen mode Exit fullscreen mode

If now the database objects on the YugabyteDB cluster are queried again you will find the tablet having added the 3rd replica:

➜ yb_stats --print-entities
Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.test, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.test.acc8f56170dd48598f0ee09a8fa8ba6f state: RUNNING
    Replicas: (yb-2.local:9100(VOTER:LEADER), yb-1.local:9100(VOTER), yb-3.local:9100(VOTER),)
Enter fullscreen mode Exit fullscreen mode

This is great, but you should be aware a lot of things happened automatically, which the YugabyteDB cluster handled for you:

The situation before starting the tablet server that we "exterminated" by removing the data from it was (for the user table) that:

  • The table is using a single tablet, and there was 1 user tablet in this test scenario.
  • The tablet is using replication factor 3, so for it to function we need a majority (2 replicas) at least, which allows a leader to be elected, which was the case.

When the tablet server was started:

  • The cluster (load balancer) found a placement option for the third replica.
  • The third replica was placed on the tablet server using a process called 'remote bootstrapping'.
  • After the replica was bootstrapped, the data was transferred.
  • The newly added replica was synchronised and added to the tablet, making it a fully functioning replica again.

Warning

On my test cluster, this mostly happened. My test cluster is sized smaller than the minimal sizing rules found here: hardware requirements.
As a consequence, the remote bootstrapping put the replica in the status 'PRE_VOTER', and tried allocating a buffer for data transfer, which it couldn't because the memory area was too small, and therefore stopped after bootstrapping in the PRE_VOTER state. To solve this issue, I had to add --remote_bootstrap_max_chunk_size=20000000 to the tablet server configuration file, this will be the subject of another blogpost.

Conclusion

This blogpost shows one of the key strengths of a YugabyteDB database cluster. Completely transparently to the database APIs, we simulated a failure of a node, which didn't interrupt availability of the database, and once we introduced a "replacement" node, the database repaired itself.

Top comments (0)