Testing LSM-Tree merge for Size Amplification in YugabyteDB

#crypto #blockchain #web3 #offers

In this series I'll show how to test the compaction algorithms in a small lab. This requires some tweaking of the RocksDB parameters to isolate one algorithm or the other.

Number of files

There are a few parameters that control the number of SST Files which:

trigger the compaction
will be merged
throttles writes
that stop writes

For Size and Read amplification algorithms

By default, compaction is triggered only when there are at least 5 SST Files at level 0.

This is governed by the RocksDB level0_file_num_compaction_trigger parameter which can be set in YugabyteDB with rocksdb_level0_file_num_compaction_trigger

To test the compaction algorithms immediately, I set this threshold to the lowest: rocksdb_level0_file_num_compaction_trigger=2

For Read amplification algorithm only

By default, the read amplification algorithm of universal compaction is triggered only when there are at least 4 SST files to merge.

This can be set with rocksdb_universal_compaction_min_merge_width

To disable the Read amplification algorithm in order to test only the Size Amplification algorithm, I set this threshold high: rocksdb_universal_compaction_min_merge_width=50

I don't need to set it higher because there cannot be more than 48 SST files. This is set with sst_files_hard_limit

Testing Size Amplification algorithm

I'm playing with the cluster configuration to get the desired behavior I want to investigate. In production, you should probably not touch them. You will see that the setting I use there will block all writes after a while.

Start the lab

I did my test on my laptop with docker, starting a shell with:

docker exec -it $(
docker run -d --rm yugabytedb/yugabyte:latest sleep infinity
) bash

Start YugabyteDB with my custom thresholds

In order to test the Size Amplification algorithm only, I start a cluster with the following:

yugabyted start --listen 0.0.0.0 \
--master_flags="\
enable_automatic_tablet_splitting=false\
" \
--tserver_flags="\
memstore_size_mb=5\
,rocksdb_level0_file_num_compaction_trigger=2\
,rocksdb_universal_compaction_min_merge_width=100\
"

I addition to the settings explained above, I set a very low memstore_size_mb, 5 MB instead of the default 128MB, so that the MemTable is flushed quickly and I can generate many SST Files in a small lab.

I have also disabled automatic tablet splitting because I want to see only one tablet (I'll create a range sharded table which starts with only one tablet).

Insert data

To quickly fill a single-tablet table I insert 10KB valies by batch of 1000 rows:


ysqlsh <<'SQL'

create table if not exists demo
        (id bigint, primary key(id asc), num bigint, value text);

create extension if not exists orafce;

create sequence if not exists demo_sequence_insert;

prepare "inserts" as
 insert into demo
 select nextval('demo_sequence_insert') 
 , 0, dbms_random.string('P',10000::int)
 from generate_series(1,1000);

set statement_timeout=15000;
\timing on

execute "inserts" \; select 
 pg_size_pretty(pg_table_size('demo'::regclass))
 , to_hex('demo'::regclass::oid::int);

\watch 0.1

SQL

This displays the total size of WAL + SST Files on each iteration. It will hang when reaching 48 SST Files (sst_files_hard_limit), for 15 seconds (set statement_timeout=15000) before failing.

It also displays the hexadecimal table OID to recognize the rocksdb data directory.

Here is my output after it stopped:

On this lab, started with yugabyted and the default base_dir my files are under /root/var/data/yb-data/tserver and the SST Files under data/rocksdb/ with a subdirectory for the table, and then the tablet. I get this from the definitions (data/rocksdb/) and the oid of the database (current_database()) and table ('demo'::regclass::oid) to list all SST files (*.sst.sblock.0) for this table:

cd $(ysqlsh -tAc"
select format('%s/tserver/%s/table-0000%s00003000800000000000%s',replace(current_setting('data_directory'),'/pg_data','/yb-data'),'data/rocksdb',(select to_hex(oid::int) from pg_database where datname=current_database()),lpad(to_hex('demo'::regclass::oid::int),4,'0'))
") && du -h tablet-*/*.sst.sblock.0 | nl

The output in my example shows the 48 SST files:

[root@5b2bd0ea1ea6 yugabyte]# cd $(ysqlsh -tAc"
> select format('%s/tserver/%s/table-0000%s00003000800000000000%s',replace(current_setting('data_directory'),'/pg_data','/yb-data'),'data/rocksdb',(select to_hex(oid::int) from pg_database where datname=current_database()),lpad(to_hex('demo'::regclass::oid::int),4,'0'))
> ") && du -h tablet-*/*.sst.sblock.0 | nl

     1  171M    tablet-573c707b7ea5476a8f57539425719c06/000081.sst.sblock.0
     2  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000082.sst.sblock.0
     3  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000084.sst.sblock.0
     4  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000086.sst.sblock.0
     5  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000088.sst.sblock.0
     6  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000090.sst.sblock.0
     7  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000092.sst.sblock.0
     8  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000094.sst.sblock.0
     9  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000096.sst.sblock.0
    10  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000098.sst.sblock.0
    11  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000099.sst.sblock.0
    12  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000101.sst.sblock.0
    13  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000103.sst.sblock.0
    14  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000105.sst.sblock.0
    15  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000107.sst.sblock.0
    16  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000109.sst.sblock.0
    17  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000111.sst.sblock.0
    18  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000112.sst.sblock.0
    19  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000114.sst.sblock.0
    20  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000116.sst.sblock.0
    21  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000118.sst.sblock.0
    22  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000120.sst.sblock.0
    23  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000122.sst.sblock.0
    24  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000124.sst.sblock.0
    25  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000126.sst.sblock.0
    26  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000128.sst.sblock.0
    27  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000130.sst.sblock.0
    28  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000131.sst.sblock.0
    29  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000133.sst.sblock.0
    30  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000135.sst.sblock.0
    31  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000137.sst.sblock.0
    32  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000138.sst.sblock.0
    33  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000140.sst.sblock.0
    34  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000142.sst.sblock.0
    35  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000144.sst.sblock.0
    36  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000146.sst.sblock.0
    37  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000148.sst.sblock.0
    38  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000150.sst.sblock.0
    39  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000152.sst.sblock.0
    40  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000154.sst.sblock.0
    41  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000156.sst.sblock.0
    42  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000158.sst.sblock.0
    43  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000160.sst.sblock.0
    44  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000162.sst.sblock.0
    45  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000164.sst.sblock.0
    46  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000166.sst.sblock.0
    47  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000168.sst.sblock.0
    48  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000170.sst.sblock.0
    49  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000172.sst.sblock.0
    50  4.8M    tablet-573c707b7ea5476a8f57539425719c06/000174.sst.sblock.0

The oldest one is the number 000081 which is large (171 MB) and then I have the subsequent 47 files with their initial size (4.8 MB as flushed from the 5MB MemTables)

The Size Amplification algorithm merges the files when the new files (the new appends to the Log-Structure) increase the database more than two times (max_size_amplification_percent=200) the size of the previous file (which is considered as already Merged).

In this example, with approximately 5MB flushes, and only inserts (no tombstones to remove) the first merge happened when there were 3 files (because of 2 x 5MB > 200% x 5MB) to one 15MB file. Then, another merge happened when there were 6 more files (as 6 x 5MB > 200% x 15 MB) to result in one 45 MB file. The next one was (as 18 x 5MB > 200% x 45 MB) resulting in one 135 MB file. This is theoretical from the flush size but there's some additional structures for indexes and bloom filters, but this is how I got a 171M file in my example.

The next compaction should occur when (as 54 x 5MB > 200% x 135 MB) but I never reached this point because of the 48 files limit.

I've gathered the stats about SST File size and count in Grafana:

This shows the compactions detailed earlier when many small files are replaced by one larger when the size of the new ones (yellow) are two times larger than the first one (green)

Here is another representation with the number of files in y-axis and the size in color from white to red:

In the next post we will add the Read Amplification algorithm that will merge the smaller files together to reduce their number. because you don't want to be stuck with 48 small files.