Databases maintain indexes to find a value or a range of values without scanning the whole table. The index is physically organized and ordered by the values you may look for so that each value has a logical place to find it.
With a tree structure over the index entries, the database can find this place with a few random reads for a minimal read amplification.
There are two major index structures used in databases:
- B-Tree: The tree's small height, which increases logarithmically with the number of index entries, minimizes the read amplification required to find the index entry. However, maintaining this structure involves variable write amplification to keep it balanced.
- LSM-Tree: The write amplification is limited to a log structure that appends the new index entry in memory. Those are flushed and compacted in the background, typically resulting in a few sorted files in one or multiple levels. The read amplification results from merging those sorted runs.
SQL databases perform frequent reading operations. Even a simple insert operation requires an index read to detect duplicate keys.
In the past, random reads were slow because they had to wait for the mechanical disk rotation (HDD). To achieve predictable performance on reads, databases designed at that time used B-trees. The higher levels of the B-tree could remain in cache, in shared memory buffers, and random reads were limited by the B-tree depth.
Today, with flash memory (SSD), random reads have low and predictable latency. It is now acceptable to do more reads to limit the write amplification.
Lot of modern databases use LSM trees, where writes are fast and appended in the first level, which is a memory table protected by WAL (written sequentially). To limit the space in memory, they are flushed to SST files and must be compacted in the background to limit the read amplification. This is done by reading the LSM tree (random reads) and writing to new SST files (sequential writes). LSM trees became popular, especially with RocksDB and derived databases, because they fit the performance of SSD storage where random reads are fast, but random writes should be avoided.
B-Tree write amplification with Oracle
Let's take an example to illustrate how writes to B-trees can be slow and unpredictable. I create the following table in Oracle 19c Autonomous Database:
create table demo (
id raw(16) default sys_guid() primary key
, ts timestamp default current_timestamp
, reads int
, writes int
);
I will insert one million rows into this table and count the number of reads (db block gets
for the current version and consistent gets
for the MVCC version) and writes (db block changes
). Rather than inserting random rows and observing the statistics, I'm inserting the statistics themselves to analyze later.
Here is what I've run in SQL Developer with a repeat
command to insert one million rows that hold my session statistics about block read and writes:
insert into demo (reads, writes)
select bg+cg , bc from (
select name,value from v$mystat
join v$statname using(statistic#)
) pivot (
sum(value) for name in ('db block changes' bc,'db block gets' bg
, 'consistent gets' cg
,'redo size' rs)
);
repeat 999999 0.001
commit;
Remember to commit before losing your transaction after the autonomous idle timeout, as SQL Developer disobeys auto-commit
with repeat
.
I use this simple query to calculate the delta from the cumulative statistics, group on the number of reads and writes, and show the sum:
select count(*), reads+writes, reads, writes from (
select id
, reads-lag(reads)over(order by ts) reads
, writes-lag(writes)over(order by ts) writes
from demo
) group by reads, writes order by reads+writes asc
;
COUNT(*) READS+WRITES READS WRITES
___________ _______________ ________ _________
162,349 10 6 4
776,173 11 7 4
9 12 8 4
4,711 12 7 5
9 13 9 4
588 13 7 6
25,590 13 8 5
1 14 10 4
2,935 14 8 6
17,414 14 9 5
471 15 9 6
11 15 8 7
4,934 15 10 5
98 16 9 7
520 16 10 6
65 17 10 7
64 17 11 6
4 18 11 7
...
12 159 98 61
1 160 98 62
1 161 100 61
3 169 120 49
1 172 122 50
1 173 126 47
4 181 129 52
1 182 132 50
2 185 134 51
1 186 134 52
3 197 143 54
1 199 138 61
1 221 174 47
1 709 705 4
1 736 732 4
1
100 rows selected.
There are two outcomes from it:
- the number of logical reads an writes are unpredictable (but yon don't see that in general as they happen on shared buffers in memory).
- The majority of inserts had to read and write eleven blocks. Some will be in the local memory cache on a monolithic system but will involve network latency on a distributed system.
- a few outliers have to read hundreds of blocks (this is not only the index blocks but also the related rollback segments).
The reason is that inserting a new entry in a B-Tree
- has to find the leaf block where the new value belongs, going from the root, through branches, to the leaf
- when there is not enough space in the leaf, it has to split it into two blocks and update the branch above. There may be no space in the branch, and the same has to be done for it.
- those changes generate undo information to allow consistent read
The general case can also be verified with a single-row insert:
DEMO@adb_tp> set autotrace on
Autotrace Enabled
Shows the execution plan as well as statistics of the statement.
DEMO@adb_tp> insert into demo (reads, writes)
/*+ gather_plan_statistics */
values(null,null)
;
1 row inserted.
PLAN_TABLE_OUTPUT
_____________________________________________________________________________________
SQL_ID 0djxp9h81cby7, child number 0
-------------------------------------
insert into demo (reads, writes) /*+
gather_plan_statistics */ values(null,null)
----------------------------------------------------------------------------------
| Id | Operation | Name | Starts | A-Rows | A-Time | Buffers |
----------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 0 |00:00:00.01 | 6 |
| 1 | LOAD TABLE CONVENTIONAL | DEMO | 1 | 0 |00:00:00.01 | 6 |
----------------------------------------------------------------------------------
...
Statistics
-----------------------------------------------------------
5 db block changes
6 db block gets
6 db block gets from cache
5 db block gets from cache (fastpath)
49152 logical read bytes from cache
3 redo entries
664 redo size
6 session logical reads
160 undo change vector size
48KB have been read to insert a 30 bytes row in a table with no secondary indexes.
In addition to read and write amplification, B-Tree index blocks are often between 100% full and 50% empty because they split when they are full, leading to an average of at least 25% free space wasted in disk and memory.
LSM-Tree with YugabyteDB
I used the same approach on YugabyteDB, which uses PostgreSQL for the SQL processing and RocksDB for the storage:
create extension if not exists pgcrypto;
create table demo (
id uuid default gen_random_uuid() primary key
, ts timestamptz default clock_timestamp()
, reads int
, writes int
)
-- YugabyteDB specific: one tablet to make it simpler
split into 1 tablets
;
The cloud-native database exposes the storage metrics on a JSON endpoint on port 9000. I use a quick and dirty BASH and AWK script to get the metrics in a format that can be imported by COPY:
copy demo(reads,writes) from program $PROGRAM$
bash <<'BASH'
exec 5<>/dev/tcp/10.0.0.61/9000
awk '
/"table_name"/{n=0;t=$NF;gsub(/[",]/,"",t)}
t="demo" && /"name":/{n=$2;gsub(/[",]/,"",n)}
t="demo" && /"value": [^0]/{s[n]=s[n]+$NF}
END{printf "%s\t%s\n",s["rocksdb_number_db_seek"],s["rows_inserted"]
}
' <&5 &
printf "GET /metrics?metrics=rows_inserted,rocksdb_number_db HTTP/1.0\r\n\r\n" >&5
exit 0
BASH
$PROGRAM$;
\watch c=999999 0.001
YugabyteDB doesn't work on blocks but reads and writes operations for table rows and index entries, which are replicated as Raft log and stored as RocksDB key values. Those operations are logical like the block reads and writes I measured in Oracle. They may operate in memory, be batched when sent through the network, and be fast, but they are the best representation of database work on data structures. The reads are the RocksDB seek
to find the key in the LSM-Tree. The writes are the RocksDB keys inserted. I do not count the RocksDB next
, which is equivalent to reading another row in a block.
I used the same query to list the number of reads and writes required for each insert, and aggregate the count.
select count(*), reads+writes, reads, writes from (
select id
, reads-lag(reads)over(order by ts) reads
, writes-lag(writes)over(order by ts) writes
from demo
) as delta_values
group by reads, writes
order by reads+writes asc;
count | ?column? | reads | writes
--------+----------+-------+--------
994447 | 3 | 2 | 1
3728 | 47 | 2 | 45
1823 | 91 | 2 | 89
1 | | |
(4 rows)
There is no write amplification. The majority of inserts did only two reads and one write. The table row value, which is the same as the primary key index entry, is appended to the LSM-Tree. Less than one percent have written more than one RocksDB key-value, and it may not even be my session as I've read the global statistics.
As I did with Oracle, I can run a single insert and look at the statistics from explain analyze
and my ybwr:
yugabyte=# explain (analyze, dist, costs off, summary off)
insert into demo (reads, writes)
values(null,null)
;
QUERY PLAN
----------------------------------------------------------
Insert on demo (actual time=0.076..0.076 rows=0 loops=1)
-> Result (actual time=0.017..0.017 rows=1 loops=1)
Storage Table Write Requests: 1
(3 rows)
yugabyte=# execute snap_table;
I-seek | I-next | I-prev | R-seek | R-next | R-prev | insert | dbname relname tablet (L)eader node | (I)ntents (R)regular
--------+--------+--------+--------+--------+--------+--------+------------------------------------------------------------
4 | 4 | | | | | 1 | yugabyte demo hash_split: [0x0000, 0xFFFF] 10.0.0.62
4 | 4 | | | | | 1 | yugabyte demo hash_split: [0x0000, 0xFFFF] 10.0.0.63
6 | 4 | | 2 | | | 1 | yugabyte demo hash_split: [0x0000, 0xFFFF] L 10.0.0.61
This matches exactly what I've seen above with 2 reads and 1 write but let's explain all these. I'm running a Replication Factor 3 database and that's why you see 1 write (RocksDB insert) on each node. When gathering the statistics, I gathered only from one node where the Raft leader was (10.0.0.61:9000
). That's also why I've created a single tablet table on a 3 nodes cluster, to make it easier. Those are for the RegularDB, the final destination for rows.
I didn't add the IntentsDB, the transaction intents, when counting the reads because for small transactions they are all together in memory, and they include the background cleanup of provisional records after commit, which I didn't count for the B-Tree.
I didn't count the writes on the replicas. YugabyteDB waits for one of them to acknowledge. With Oracle, you would have two standby databases applying the same db block change
when applying the redo stream, and wait for the commit to be acknowledged.
Space amplification in Oracle Heap Table + B-Tree
I check the size of the table and index:
DEMO@adb_tp> info demo
...
Indexes
INDEX_NAME UNIQUENESS STATUS FUNCIDX_STATUS COLUMNS
____________________ _____________ _________ _________________ __________
DEMO.SYS_C0036334 UNIQUE VALID ID
DEMO@adb_tp> select dbms_xplan.format_size(bytes) from dba_segments where segment_name='DEMO';
DBMS_XPLAN.FORMAT_SIZE(BYTES)
________________________________
50M
DEMO@adb_tp> select dbms_xplan.format_size(bytes) from dba_segments where segment_name='SYS_C0036334';
DBMS_XPLAN.FORMAT_SIZE(BYTES)
________________________________
29M
The heap table allocated 50MB and the B-Tree Index 29M
Space amplification in YugabyteDB LSM-Tree
The whole table is stored in the primary key in YugabyteDB:
yugabyte=# select pg_size_pretty(pg_table_size('demo'));
pg_size_pretty
----------------
178 MB
(1 row)
This size includes the WAL but I didn't count the redo logs for Oracle. To compare, I check the size of the SST files from the console:
Total: 177.90M
Consensus Metadata: 1.5K
WAL Files: 128.01M
SST Files: 49.89M
SST Files Uncompressed: 88.91M
The inserted data takes 50MB on disk
B-Tree read amplification with Oracle
Let's query a single non existing key:
DEMO@adb_tp> set autotrace on
DEMO@adb_tp> select * from demo
where id=hextoraw('00000000000000000000000000000000')
;
no rows selected
PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________________
SQL_ID a139wcm6m7vu4, child number 0
-------------------------------------
select * from demo where id=hextoraw('00000000000000000000000000000000')
Plan hash value: 2095372258
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.01 | 3 |
| 1 | TABLE ACCESS BY INDEX ROWID| DEMO | 1 | 1 | 0 |00:00:00.01 | 3 |
|* 2 | INDEX UNIQUE SCAN | SYS_C0036334 | 1 | 1 | 0 |00:00:00.01 | 3 |
------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ID"=HEXTORAW('00000000000000000000000000000000'))
Statistics
-----------------------------------------------------------
1 DB time
7733 RM usage
3 Requests to/from client
3 consistent gets
3 consistent gets examination
3 consistent gets examination (fastpath)
3 consistent gets from cache
12 non-idle wait count
2 opened cursors cumulative
1 opened cursors current
1 pinned cursors current
28 process last non-idle time
3 session logical reads
3 user calls
This has read three blocks: 1 B-Tree root + 1 branch + 1 leaf.
LSM-Tree read amplification with YugabyteDB
Let's do the same with YugabyteDB:
yugabyte=# explain (analyze, dist, debug, costs off, summary off) select * from demo where id='00000000-0000-0000-0000-000000000000';
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using demo_pkey on demo (actual time=0.539..0.539 rows=0 loops=1)
Index Cond: (id = '00000000-0000-0000-0000-000000000000'::uuid)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 0.447 ms
Metric rocksdb_block_cache_hit: 1.000
Metric rocksdb_block_cache_filter_hit: 1.000
Metric rocksdb_block_cache_bytes_read: 65477.000
Metric rocksdb_bloom_filter_useful: 1.000
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_block_cache_multi_touch_hit: 1.000
Metric rocksdb_block_cache_multi_touch_bytes_read: 65477.000
Metric ql_read_latency: sum: 46.000, count: 1.000
(12 rows)
This was only 1 seek into the LSM tree (rocksdb_number_db_seek: 1.000
).
To summarize
This measures the read and writes on the index when inserting a row in a SQL table. Different database architecture have different units:
- To find where to read or write, B-Tree reads a few blocks, from root, though branches, to leaf. In an LSM-Tree, the equivalent is a seek operation to find the key in the SST Files and MemTables
- To read from that point, B-Tree blocks are pinned and variable size rows are read. In an LSM-Tree, the equivalent is a next operation. With MVCC databases, this also involves reading some past versions. Those are in other blocks in Oracle (rollback segments) but are adjacent in YugabyteDB (within the next key).
- To write, B-Tree inserts the new entry in the block, and may have to make more space by splitting it and updating the branches. In an LSM-Tree, writes are simply appended to the MemTable
Except with intensive workloads, you will not find huge differences on the response time because each database has been optimized for its architecture. Databases using B-Tree, like Oracle, keep most of the blocks involved in the local shared memory. Databases using LSM-Tree, like YugabyteDB, have enhanced RocksDB to avoid reading from all SST files (see Enhancing RocksDB for Speed & Scale).
For Distributed SQL, LSM-Tree goes well with Raft replication because synchronizing a log is simpler than synchronizing shared buffers.
LSM-Tree have other advantages as they write immutable files, flushed from MemTable, or merged by compaction. This reduces the risk of block corruption as they are never updated, and allows easy snapshot without an additional flashback logs.
If you are not familiar with LSM-Tree, @denismagda explains it when ordering pizzas:
Top comments (9)
Hi Frank. Nice article, appreciated.
Though I have a couple of questions:
Thanks, and that's a great question.
Yes, YugabyteDB detects a duplicate key on insert, which is why the read is performed before the write. On non-unique indexes, there is no read at all.
I'm not sure why I have two reads rather than one here. I should trace the internals to be sure.
Currently, one read goes to memtable and sstables ("and" - not "and then") because one SQL key may have multiple versions and then multiple rocksdb keys.
However, there are bloom filters to skip some stables, as well as min/max information for each column.
Thanks for your reply, much clear now.
So does it mean that in some occasions (e.g. insert on unique index) there's no much performance gain in using LSM tree over B-Tree because of that extra read?
I mean when using B-tree in postgresql there's a read of b-tree leaf page and then updating the page.
And in LSM tree there's also that extra read.
Yes it is hard to compare, but with LSM Tree, I tend to create more secondary indexes (non-unique)
And something I just learned is that even when inserting into a non-unique index, YugabyteDB still has to read the key to detect a serialization conflict
I think that extra read frustrates the whole point of LSM tree write performance compared to B-Tree.
Yes, and the high rate of data ingest increases the number of SST files to read from.
For bulk load, YugabyteDB offers the possibility to disable transactional writes, and there's the full power of LSM Tree, but no rollback and no transaction isolation.
A trade-off as always.
Thanks a lot, Franck.
You are my number one blog and twitter author I follow)