Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Mar 21 • Edited on Mar 22

B-Tree vs. LSM-Tree: measuring the write amplification on Oracle Database and YugabyteDB

#yugabytedb #rocksdb #sql #distributed

Databases maintain indexes to find a value or a range of values without scanning the whole table. The index is physically organized and ordered by the values you may look for so that each value has a logical place to find it.

With a tree structure over the index entries, the database can find this place with a few random reads for a minimal read amplification.

There are two major index structures used in databases:

B-Tree: The tree's small height, which increases logarithmically with the number of index entries, minimizes the read amplification required to find the index entry. However, maintaining this structure involves variable write amplification to keep it balanced.
LSM-Tree: The write amplification is limited to a log structure that appends the new index entry in memory. Those are flushed and compacted in the background, typically resulting in a few sorted files in one or multiple levels. The read amplification results from merging those sorted runs.

SQL databases perform frequent reading operations. Even a simple insert operation requires an index read to detect duplicate keys.

In the past, random reads were slow because they had to wait for the mechanical disk rotation (HDD). To achieve predictable performance on reads, databases designed at that time used B-trees. The higher levels of the B-tree could remain in cache, in shared memory buffers, and random reads were limited by the B-tree depth.

Today, with flash memory (SSD), random reads have low and predictable latency. It is now acceptable to do more reads to limit the write amplification.

Lot of modern databases use LSM trees, where writes are fast and appended in the first level, which is a memory table protected by WAL (written sequentially). To limit the space in memory, they are flushed to SST files and must be compacted in the background to limit the read amplification. This is done by reading the LSM tree (random reads) and writing to new SST files (sequential writes). LSM trees became popular, especially with RocksDB and derived databases, because they fit the performance of SSD storage where random reads are fast, but random writes should be avoided.

B-Tree write amplification with Oracle

Let's take an example to illustrate how writes to B-trees can be slow and unpredictable. I create the following table in Oracle 19c Autonomous Database:

create table demo ( 
 id raw(16) default sys_guid() primary key
 , ts timestamp default current_timestamp
 , reads int
 , writes int
);

I will insert one million rows into this table and count the number of reads (db block gets for the current version and consistent gets for the MVCC version) and writes (db block changes). Rather than inserting random rows and observing the statistics, I'm inserting the statistics themselves to analyze later.

Here is what I've run in SQL Developer with a repeat command to insert one million rows that hold my session statistics about block read and writes:

insert into demo (reads, writes)
select bg+cg , bc from ( 
 select name,value from v$mystat 
 join v$statname using(statistic#) 
) pivot ( 
 sum(value) for name in ('db block changes' bc,'db block gets' bg
 , 'consistent gets' cg
 ,'redo size' rs) 
);
repeat 999999 0.001
commit;

Remember to commit before losing your transaction after the autonomous idle timeout, as SQL Developer disobeys auto-commit with repeat.

I use this simple query to calculate the delta from the cumulative statistics, group on the number of reads and writes, and show the sum:

select count(*), reads+writes, reads, writes from (
select id
 , reads-lag(reads)over(order by ts) reads
 , writes-lag(writes)over(order by ts) writes
 from demo
) group by reads, writes order by reads+writes asc
;
   COUNT(*)    READS+WRITES    READS    WRITES
___________ _______________ ________ _________
    162,349              10        6         4
    776,173              11        7         4
          9              12        8         4
      4,711              12        7         5
          9              13        9         4
        588              13        7         6
     25,590              13        8         5
          1              14       10         4
      2,935              14        8         6
     17,414              14        9         5
        471              15        9         6
         11              15        8         7
      4,934              15       10         5
         98              16        9         7
        520              16       10         6
         65              17       10         7
         64              17       11         6
          4              18       11         7
...
         12             159       98        61
          1             160       98        62
          1             161      100        61
          3             169      120        49
          1             172      122        50
          1             173      126        47
          4             181      129        52
          1             182      132        50
          2             185      134        51
          1             186      134        52
          3             197      143        54
          1             199      138        61
          1             221      174        47
          1             709      705         4
          1             736      732         4
          1              

100 rows selected.

There are two outcomes from it:

the number of logical reads an writes are unpredictable (but yon don't see that in general as they happen on shared buffers in memory).
The majority of inserts had to read and write eleven blocks. Some will be in the local memory cache on a monolithic system but will involve network latency on a distributed system.
a few outliers have to read hundreds of blocks (this is not only the index blocks but also the related rollback segments).

The reason is that inserting a new entry in a B-Tree

has to find the leaf block where the new value belongs, going from the root, through branches, to the leaf
when there is not enough space in the leaf, it has to split it into two blocks and update the branch above. There may be no space in the branch, and the same has to be done for it.
those changes generate undo information to allow consistent read

The general case can also be verified with a single-row insert:

DEMO@adb_tp> set autotrace on
Autotrace Enabled
Shows the execution plan as well as statistics of the statement.

DEMO@adb_tp> insert into demo (reads, writes) 
             /*+ gather_plan_statistics */
             values(null,null)
;
1 row inserted.

PLAN_TABLE_OUTPUT
_____________________________________________________________________________________
SQL_ID  0djxp9h81cby7, child number 0
-------------------------------------
insert into demo (reads, writes)               /*+
gather_plan_statistics */              values(null,null)


----------------------------------------------------------------------------------
| Id  | Operation                | Name | Starts | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |      1 |      0 |00:00:00.01 |       6 |
|   1 |  LOAD TABLE CONVENTIONAL | DEMO |      1 |      0 |00:00:00.01 |       6 |
----------------------------------------------------------------------------------
...
Statistics
-----------------------------------------------------------
               5  db block changes
               6  db block gets
               6  db block gets from cache
               5  db block gets from cache (fastpath)
           49152  logical read bytes from cache
               3  redo entries
             664  redo size
               6  session logical reads
             160  undo change vector size

48KB have been read to insert a 30 bytes row in a table with no secondary indexes.

In addition to read and write amplification, B-Tree index blocks are often between 100% full and 50% empty because they split when they are full, leading to an average of at least 25% free space wasted in disk and memory.

LSM-Tree with YugabyteDB

I used the same approach on YugabyteDB, which uses PostgreSQL for the SQL processing and RocksDB for the storage:

create extension if not exists pgcrypto;
create table demo (
 id uuid default gen_random_uuid() primary key
 , ts timestamptz default clock_timestamp()
 , reads int
 , writes int
)
-- YugabyteDB specific: one tablet to make it simpler 
split into 1 tablets
;

The cloud-native database exposes the storage metrics on a JSON endpoint on port 9000. I use a quick and dirty BASH and AWK script to get the metrics in a format that can be imported by COPY:

copy demo(reads,writes) from program $PROGRAM$
bash <<'BASH'
exec 5<>/dev/tcp/10.0.0.61/9000
awk '
/"table_name"/{n=0;t=$NF;gsub(/[",]/,"",t)}
t="demo" && /"name":/{n=$2;gsub(/[",]/,"",n)}
t="demo" && /"value": [^0]/{s[n]=s[n]+$NF}
END{printf "%s\t%s\n",s["rocksdb_number_db_seek"],s["rows_inserted"]
}
' <&5 & 
printf "GET /metrics?metrics=rows_inserted,rocksdb_number_db HTTP/1.0\r\n\r\n" >&5 
exit 0
BASH
$PROGRAM$;
\watch c=999999 0.001

YugabyteDB doesn't work on blocks but reads and writes operations for table rows and index entries, which are replicated as Raft log and stored as RocksDB key values. Those operations are logical like the block reads and writes I measured in Oracle. They may operate in memory, be batched when sent through the network, and be fast, but they are the best representation of database work on data structures. The reads are the RocksDB seek to find the key in the LSM-Tree. The writes are the RocksDB keys inserted. I do not count the RocksDB next, which is equivalent to reading another row in a block.

I used the same query to list the number of reads and writes required for each insert, and aggregate the count.

select count(*), reads+writes, reads, writes from (
select id
 , reads-lag(reads)over(order by ts) reads
 , writes-lag(writes)over(order by ts) writes
 from demo
) as delta_values
group by reads, writes 
order by reads+writes asc;

 count  | ?column? | reads | writes
--------+----------+-------+--------
 994447 |        3 |     2 |      1
   3728 |       47 |     2 |     45
   1823 |       91 |     2 |     89
      1 |          |       |
(4 rows)

There is no write amplification. The majority of inserts did only two reads and one write. The table row value, which is the same as the primary key index entry, is appended to the LSM-Tree. Less than one percent have written more than one RocksDB key-value, and it may not even be my session as I've read the global statistics.

As I did with Oracle, I can run a single insert and look at the statistics from explain analyze and my ybwr:

yugabyte=# explain (analyze, dist, costs off, summary off)
insert into demo (reads, writes)
             values(null,null)
;
                        QUERY PLAN
----------------------------------------------------------
 Insert on demo (actual time=0.076..0.076 rows=0 loops=1)
   ->  Result (actual time=0.017..0.017 rows=1 loops=1)
         Storage Table Write Requests: 1
(3 rows)

yugabyte=# execute snap_table;

 I-seek | I-next | I-prev | R-seek | R-next | R-prev | insert | dbname relname tablet (L)eader node | (I)ntents (R)regular
--------+--------+--------+--------+--------+--------+--------+------------------------------------------------------------
      4 |      4 |        |        |        |        |      1 | yugabyte demo hash_split: [0x0000, 0xFFFF]   10.0.0.62
      4 |      4 |        |        |        |        |      1 | yugabyte demo hash_split: [0x0000, 0xFFFF]   10.0.0.63
      6 |      4 |        |      2 |        |        |      1 | yugabyte demo hash_split: [0x0000, 0xFFFF] L 10.0.0.61

This matches exactly what I've seen above with 2 reads and 1 write but let's explain all these. I'm running a Replication Factor 3 database and that's why you see 1 write (RocksDB insert) on each node. When gathering the statistics, I gathered only from one node where the Raft leader was (10.0.0.61:9000). That's also why I've created a single tablet table on a 3 nodes cluster, to make it easier. Those are for the RegularDB, the final destination for rows.

I didn't add the IntentsDB, the transaction intents, when counting the reads because for small transactions they are all together in memory, and they include the background cleanup of provisional records after commit, which I didn't count for the B-Tree.

I didn't count the writes on the replicas. YugabyteDB waits for one of them to acknowledge. With Oracle, you would have two standby databases applying the same db block change when applying the redo stream, and wait for the commit to be acknowledged.

Space amplification in Oracle Heap Table + B-Tree

I check the size of the table and index:

DEMO@adb_tp> info demo
...
Indexes
INDEX_NAME           UNIQUENESS    STATUS    FUNCIDX_STATUS    COLUMNS
____________________ _____________ _________ _________________ __________
DEMO.SYS_C0036334    UNIQUE        VALID                       ID

DEMO@adb_tp> select dbms_xplan.format_size(bytes) from dba_segments where segment_name='DEMO';

DBMS_XPLAN.FORMAT_SIZE(BYTES)
________________________________
50M

DEMO@adb_tp> select dbms_xplan.format_size(bytes) from dba_segments where segment_name='SYS_C0036334';

DBMS_XPLAN.FORMAT_SIZE(BYTES)
________________________________
29M

The heap table allocated 50MB and the B-Tree Index 29M

Space amplification in YugabyteDB LSM-Tree

The whole table is stored in the primary key in YugabyteDB:

yugabyte=# select pg_size_pretty(pg_table_size('demo'));
 pg_size_pretty
----------------
 178 MB
(1 row)

This size includes the WAL but I didn't count the redo logs for Oracle. To compare, I check the size of the SST files from the console:

Total: 177.90M
Consensus Metadata: 1.5K
WAL Files: 128.01M
SST Files: 49.89M
SST Files Uncompressed: 88.91M

The inserted data takes 50MB on disk

B-Tree read amplification with Oracle

Let's query a single non existing key:

DEMO@adb_tp> set autotrace on

DEMO@adb_tp> select * from demo
             where id=hextoraw('00000000000000000000000000000000')
;

no rows selected

PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________________
SQL_ID  a139wcm6m7vu4, child number 0
-------------------------------------
select * from demo where id=hextoraw('00000000000000000000000000000000')

Plan hash value: 2095372258

------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |      1 |        |      0 |00:00:00.01 |       3 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DEMO         |      1 |      1 |      0 |00:00:00.01 |       3 |
|*  2 |   INDEX UNIQUE SCAN         | SYS_C0036334 |      1 |      1 |      0 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ID"=HEXTORAW('00000000000000000000000000000000'))


Statistics
-----------------------------------------------------------
               1  DB time
            7733  RM usage
               3  Requests to/from client
               3  consistent gets
               3  consistent gets examination
               3  consistent gets examination (fastpath)
               3  consistent gets from cache
              12  non-idle wait count
               2  opened cursors cumulative
               1  opened cursors current
               1  pinned cursors current
              28  process last non-idle time
               3  session logical reads
               3  user calls

This has read three blocks: 1 B-Tree root + 1 branch + 1 leaf.

LSM-Tree read amplification with YugabyteDB

Let's do the same with YugabyteDB:

yugabyte=# explain (analyze, dist, debug, costs off, summary off) select * from demo where id='00000000-0000-0000-0000-000000000000';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Index Scan using demo_pkey on demo (actual time=0.539..0.539 rows=0 loops=1)
   Index Cond: (id = '00000000-0000-0000-0000-000000000000'::uuid)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 0.447 ms
   Metric rocksdb_block_cache_hit: 1.000
   Metric rocksdb_block_cache_filter_hit: 1.000
   Metric rocksdb_block_cache_bytes_read: 65477.000
   Metric rocksdb_bloom_filter_useful: 1.000
   Metric rocksdb_number_db_seek: 1.000
   Metric rocksdb_block_cache_multi_touch_hit: 1.000
   Metric rocksdb_block_cache_multi_touch_bytes_read: 65477.000
   Metric ql_read_latency: sum: 46.000, count: 1.000
(12 rows)

This was only 1 seek into the LSM tree (rocksdb_number_db_seek: 1.000).

To summarize

This measures the read and writes on the index when inserting a row in a SQL table. Different database architecture have different units:

To find where to read or write, B-Tree reads a few blocks, from root, though branches, to leaf. In an LSM-Tree, the equivalent is a seek operation to find the key in the SST Files and MemTables
To read from that point, B-Tree blocks are pinned and variable size rows are read. In an LSM-Tree, the equivalent is a next operation. With MVCC databases, this also involves reading some past versions. Those are in other blocks in Oracle (rollback segments) but are adjacent in YugabyteDB (within the next key).
To write, B-Tree inserts the new entry in the block, and may have to make more space by splitting it and updating the branches. In an LSM-Tree, writes are simply appended to the MemTable

Except with intensive workloads, you will not find huge differences on the response time because each database has been optimized for its architecture. Databases using B-Tree, like Oracle, keep most of the blocks involved in the local shared memory. Databases using LSM-Tree, like YugabyteDB, have enhanced RocksDB to avoid reading from all SST files (see Enhancing RocksDB for Speed & Scale).

For Distributed SQL, LSM-Tree goes well with Raft replication because synchronizing a log is simpler than synchronizing shared buffers.

LSM-Tree have other advantages as they write immutable files, flushed from MemTable, or merged by compaction. This reduces the risk of block corruption as they are never updated, and allows easy snapshot without an additional flashback logs.

If you are not familiar with LSM-Tree, @denismagda explains it when ordering pizzas:

How Java Litters Beyond the Heap: Distributed Databases | FoojayHow Java Litters Beyond the Heap: Distributed Databases | Foojay

Let’s create a Java application with YugabyteDB distributed database to see if garbage is generated in response to application requests.

foojay.io

Top comments (9)

IAlkhasov • May 9

Hi Frank. Nice article, appreciated.
Though I have a couple of questions:

For YugabyteDB you showed 2 reads and 1 write. What are those 2 reads used for?
How does YugabyteDB detect duplicate primary key on insert? Or how does it detect row with an id exists on update? One of the ways is simply lookup in memtable and then sstables, but that's too slow. Maybe using bloomfilters?

Franck Pachot • May 10

Thanks, and that's a great question.
Yes, YugabyteDB detects a duplicate key on insert, which is why the read is performed before the write. On non-unique indexes, there is no read at all.
I'm not sure why I have two reads rather than one here. I should trace the internals to be sure.

Currently, one read goes to memtable and sstables ("and" - not "and then") because one SQL key may have multiple versions and then multiple rocksdb keys.
However, there are bloom filters to skip some stables, as well as min/max information for each column.

IAlkhasov • May 10

Thanks for your reply, much clear now.
So does it mean that in some occasions (e.g. insert on unique index) there's no much performance gain in using LSM tree over B-Tree because of that extra read?

IAlkhasov • May 10

I mean when using B-tree in postgresql there's a read of b-tree leaf page and then updating the page.
And in LSM tree there's also that extra read.

Franck Pachot • May 10

Yes it is hard to compare, but with LSM Tree, I tend to create more secondary indexes (non-unique)

Franck Pachot • May 22

And something I just learned is that even when inserting into a non-unique index, YugabyteDB still has to read the key to detect a serialization conflict

IAlkhasov • May 22

I think that extra read frustrates the whole point of LSM tree write performance compared to B-Tree.

Franck Pachot • May 22

Yes, and the high rate of data ingest increases the number of SST files to read from.
For bulk load, YugabyteDB offers the possibility to disable transactional writes, and there's the full power of LSM Tree, but no rollback and no transaction isolation.

IAlkhasov • May 22

A trade-off as always.

Thanks a lot, Franck.
You are my number one blog and twitter author I follow)

DEV Community

B-Tree vs. LSM-Tree: measuring the write amplification on Oracle Database and YugabyteDB

B-Tree write amplification with Oracle

LSM-Tree with YugabyteDB

Space amplification in Oracle Heap Table + B-Tree

Space amplification in YugabyteDB LSM-Tree

B-Tree read amplification with Oracle

LSM-Tree read amplification with YugabyteDB

To summarize

How Java Litters Beyond the Heap: Distributed Databases | FoojayHow Java Litters Beyond the Heap: Distributed Databases | Foojay

Top comments (9)

Read next

LIBRARY MANAGEMENT SYSTEM USING JAVA AND SQL

Exploring new AWS Aurora DSQL. What is it ? Why it is important ? How to quickstart ?

Day 1: Getting Started with SQL - Basics | Beginners' Guide : Mastering

Oracle Database Fragmentation - How to reorg Tables / Indexes