In Yesterday's post I raised some questions on understanding and testing how a database works when it is not open-source and not available in a lab. Today, I wanted to verify that, in YugabyteDB, applying the committed transaction intents was done independently in each replica. YugabyteDB is open source, you can run it on a lab (I'll use Docker on my laptop here), even attach a debugger to stop at some points to simulate special conditions. There are also some configuration parameters to easily run into some special conditions, because that's how its resilience to failures can be tested. Those parameters are not for use in production, and start with
How it works
In YugabyteDB, the SQL transactions are transformed, by the YSQL layer, based on PostgreSQL, to key-value intents (or provisional records) holding the locks and changes. Those are DocDB operations targeting a tablet (determined with by hash or range of the primary key or index key). The operations are sent to the Leader tablet (each tablet is a Raft group). The Leader writes it to its Write Ahead Logging (WAL), sends it to the Followers, waits for the quorum (the majority of them), then writes the change into the IntentsDB and acknowledges this write operation to the SQL layer.
Now, any transaction can read those intents:
- the same transaction that did those writes must see the new value
- the other transactions must see the locks, and check the transaction table to see if they are committed, and then visible, or not
When the SQL layer sends a commit operation, this marker goes to the IntentsDB as the last operation for this transaction.
At this point, the job is done from a SQL point of view. All writes are recorded. What is in memory is protected by the WAL, and what didn't fit in the MemTable was flushed to SST Files. Both are on persistent storage. The same has been done in the Leader and at least the majority of the Followers. Anyone can read and the performance is not so bad because it is a LSM Tree where the changes are logically sorted by the key (primary key of the table or indexed columns for the secondary indexes).
However, reading from the IntentsDB requires that the transaction table is checked for the transaction status and commit time. There are also lot of information that is not needed anymore, like the locks that were released by the commit. This is why, once a transaction is committed, the intents are applied to the RegularDB.
Typically, the queries have to read from both of those "databases" (each of them is a kind of RocksDB). The goal is that the IntentsDB is small, with only the ongoing transaction, and probably stays in the MemTable.
Note that the WAL (Write-Ahead Logging) happened above them, at tablet level, on each tablet peer, and the RocksDB WAL itself is disabled in YugabyteDB. Note also that SST File compaction happens later, for each of them. Here, I'm detailing only the apply of the provisional records from IntentsDB to the RegularDB.
Testing in a lab
With this explanation, it should be clear that applying the provisional records is done independently by each tablet tier. There's no need for additional replication because each replica has already all intents. However, to validate my understanding, I like to see it 👀. I'll artificially delay this apply operation on the Leader tablet peer to show that the others can apply their provisional records without waiting for the leader.
To validate my understanding I'm starting a 3 nodes cluster. I'll set
TEST_inject_sleep_before_applying_intents_ms to 60 seconds for the first tserver
yb0 and leave the default of 0 seconds in the others:
docker network create -d bridge yb docker run -d --name yb0 --hostname yb0 \ --net=yb -p9000:9000 -p7000:7000 -p5433:5433 \ yugabytedb/yugabyte:188.8.131.52-b30 yugabyted start --daemon=false \ --tserver_flags=TEST_inject_sleep_before_applying_intents_ms=60000 \ --listen yb0 until docker exec -it yb0 yb-admin -init_master_addrs yb0 list_all_masters | grep --color=auto -C42 'LEADER' ; do sleep 1 ; done docker run -d --name yb1 --hostname yb1 \ --net=yb -p9001:9000 \ yugabytedb/yugabyte:184.108.40.206-b30 yugabyted start --daemon=false \ --tserver_flags=TEST_inject_sleep_before_applying_intents_ms=0 \ --listen yb1 --join yb0 until docker exec -it yb0 yb-admin -init_master_addrs yb0 list_all_masters | grep --color=auto -C42 'FOLLOWER' ; do sleep 1 ; done docker run -d --name yb2 --hostname yb2 \ --net=yb -p9002:9000 \ yugabytedb/yugabyte:220.127.116.11-b30 yugabyted start --daemon=false \ --tserver_flags=TEST_inject_sleep_before_applying_intents_ms=0 \ --listen yb2 --join yb0 until docker exec -it yb0 yb-admin -init_master_addrs yb0 get_universe_config | grep --color=auto -C42 '"numReplicas":3' ; do sleep 1 ; done
I start a SQL session and create a demo table:
docker exec -it yb0 ysqlsh -h yb0 create extension orafce; create table demo (id bigserial primary key, value text);
On the Master webconsole (http://localhost:7000/table) I can see 3 tablets and remember which one has its leader on
yb0 where I delayed the intents apply:
On the T-Server webconsole (http://localhost:900[0-2]/varz) I'll look at the size of the MemTable for IntentsDB and RegularDB for this tablet. With my empty table, I have the same on the 3 servers, all empty:
I'm looking at the MemTable only because I'll insert only a size of data that fits in memory and will never be flushed to SST files.
I start a transaction and insert 100MB of data (I use dbms_random.string from orafce extension), leaving the transaction open:
begin transaction; insert into demo(value) select dbms_random.string('P',1024*1024) from generate_series(1,100);
The rows have been distributed to 3 tablets, so that I have one third on each. Those are provisional records (not committed yet) and the I can see 36MB in IntentsDB, the same on the 3 servers:
Now I commit my transaction and I check the size of RegularDB on all servers.
yb0 nothing has changed because this is where I delayed the applied with
The other two servers
yb2 with the default configuration have applied the provisional records from IntentsDB to the RegularDB and I can see the same size allocated:
After one minute I come back to
yb0 and see that it has been applied there as well:
This validates the fact that once the provisional records have been replicated to the followers, each replica can independently apply them when they receive the 'commit' record.
Thanks to the two-layer architecture, YugabyteDB combines the advantages of:
- physical replication (performance, reliability, no conflicts) because from the SQL point of view, DocDB is the storage
- logical replication (do not replicate corruption, reduces the network transfer, per-tablet rebalancing, rolling upgrade...) because from a storage point of view, the key-value is logical.
Top comments (0)