In YugabyteDB, after the write operations for table rows (or index entries) are distributed to tablets (by hash or range of their key) and replicated to tablet peers (Raft leaders and followers), they are stored into a LSM-Tree (based on RocksDB code).
This is a Log-Structured Merge Tree which means that:
- Log-Structure: writes are appended as new versions, as a log, and are never updated in-place like in a B-Tree does
- Tree: this will be written to different levels. The first one being in memory (MemTable) which is the fastest to ensure persistence (it is protected by Write-Ahead Logging). This is then flushed to SST Files (Sorted Structure Tables) that are immutable.
- Merge: as those files are continuously growing in number, they are merged together by compaction into a new file, and the input ones can be removed.
This is fast for writes but there's some housekeeping needed, and that's the job of compaction. First, even if rows are continuously inserted and deleted, the size of the database will increase with the log of changes. the compaction will remove the deleted entries and the intermediate versions once they are not needed (i.e after
timestamp_history_retention_interval_sec) like a garbage collection. This is measured as Size Amplification: the actual size of files compared to the size of data if all was compacted.
Reads are fast in one Memtable or SST file because they are sorted structures, but a value can be in any of them. When there are multiple SST Files, some optimizations (bloom filters) help to reduce the number of files to seek into, but there's sill more work to do than needed. This is measured as Read Amplification and is the reason to merge multiple small files into a larger one.
The Merge and Compaction are actually the same process. If the insert and delete operations are in different files, because they happened between flushes of the MemTable, they cannot be removed before those files are merged, so that they are in the same sorted structure. The reason is that each SST file can cover the whole range of keys.
In RocksDB there are additional levels where SST Files are covering a smaller range of keys because keeping all at Level 0 (the first on-disk level after MemTable flushes) in a large database would not be efficient as a full compaction would have to write terabytes of data. In YugabyteDB the size of the LSM-Tree is limited thanks to automatic sharding and tablet splitting and all stays at Level 0.
The compaction used by YugabyteDB is based on the RocksDB Universal Compaction
In this series of blog posts, I'll give more details and also how to test it in a lab. Before starting, you may watch this great summary by John Meehan and Kannan Muthukkaruppan during the Yugabyte Friday Tech Talks:
- Size Amplification algorithm compares the size of all SST Files (which includes all intermediate changes) to the oldest one (which is supposed to be the target size when merging all changes). This reduces the size of the database and the number of SST Files.
- Read Amplification compares the small recent files to merge them as long as they are of similar size. This reduces the number of SST Files without waiting for the Size Amplification to do it.
You can also follow the Yugabyte blog where John will explain the compaction in YugabyteDB.