This article will be fairly different from my previous ones, for two reasons:
- We will discuss a rather specific & low-level topic
- I'm still in the middle of researching the details, so any input is welcome and please correct me if I'm wrong somewhere.
As some of you may know, I have a thing for databases. In particular databases which offer ACID transactions:
- Atomic: Either all operations in a transaction are applied, or none. No intermediate (inconsistent) states are ever persisted.
- Consistent: A transaction takes the database from a consistent state into another consistent state. What that means in detail depends on the database paradigm, but usually it involves conformance of the data to some predefined schema.
- Isolated: While the transaction is active, you do not see any changes performed by other transactions. You are isolated as if you were the only user on the database.
Durable: When a transaction
commit()call returns, the data is written to disk and thus durable.
Lately, I am looking into the Atomic and Durable side of things. Let's take a step back and ignore for the moment what kind of database we are building. If we want to have atomicity and durability, it means that we eventually need to atomically update a file in the file system. This file can contain anything, from B-Trees to unstructured binary blobs of data to lists of elements - we do not care. All we want to do is to update it atomically and in a durable fashion.
It turns out: this is a lot easier said than done.
Let's do a quick recap where we are going:
We have a single file containing arbitrary binary data (let's call it the data file). To this file we want to apply atomic and durable updates.
For simplicity, we assume that only one process and only one thread will access it concurrently at any point in time (i.e. we use a global lock).
We must assert that any series of changes to the file is either fully applied or fully rejected. This is true even in case of sudden system power loss.
It is tempting to believe that atomically modifying a file is as simple as:
- Get a
Filepointer/handle/object from the file system API
- Open it (e.g.
- Change the contents
- Close it
- Release the file pointer/handle/object
In practice, the operating system (or rather: the file system provided by the operating system) will interfere with our plans. When you change the contents of a file, they are not actually written to disk. All modern file systems have a write-behind buffer. So if you modify the file contents, what actually happens is that the altered version of the file ends up in a buffer in main memory. Should a sudden power outage occur while this data has not been written to disk yet, the changes will be lost.
There is a command on pretty much any widely used operating system out there which forces the operating system to flush the write-behind buffer to disk. This command is named slightly differently in different operating systems, but for the sake of this article we will simply refer to it as
fsync() (short for "file synchronization"). The contract of
fsync() is that when a call to
fsync() returns, then all file modifications have been written to the actual disk. Therefore, after an
fsync(), power loss is no longer an issue.
So we perform our modifications and just add an
fsync() afterwards? If only it were that easy... Sadly,
fsync() is not atomic itself.
On the lower levels of the operating system, all files are split into pages. Usually, all pages have the same, fixed length. How long a single page happens to be differs from one OS to another. For reference, in Linux by default one page is 4KB. So if our data file has 3GB in size, that would be quite a lot of pages. Now consider that an incoming modification requires us to touch more than one page. We call
fsync(), and while
fsync() just finished flushing the first 20 pages to disk, the remaining 80 are still in-memory. Now the power supply is cut off. The result is that our data file is now corrupted: it has been subject to partial changes, and atomicity is violated.
Now let's consider one more thing (as if the outlook wasn't dark enough already): we are dealing with physical hardware at this point. Updating a single page might not be instantaneous. There will be a time frame (even though a very small one) where the hardware has just begun writing the contents of a memory page to disk, but hasn't completed the page yet. Power is cut off. We have a half-written page update on disk, and another scenario for corruption that we need to take care of.
The core idea for tackling the above mentioned issues is to have a Write-Ahead-Log (WAL). The WAL is essentially just a file that lives on disk, often directly next to our data file. The WAL file keeps track of all modifications that need to be applied to our data file. For example, in case of a key-value store, the WAL might look like this:
Transaction 1 start Insert: "Hello" -> "World" Insert: "Foo" -> "Bar" Transaction 1 end Transaction 2 start Delete: "Foo" Transaction 2 end
We buffer every transaction into the WAL file before we apply it to the data file. Periodically, we flush the WAL file contents and apply them on the data file. The big advantage here is that we know in advance what is going to change. We can copy all parts of our data file which will be affected by a transaction, and keep a backup of these pages. In case that anything goes wrong (e.g. power loss), at the next startup of the system we can roll back the transaction by replacing the (potentially) corrupted pages with our backup, and things are fine again.
Unfortunately this isn't quite the answer to atomic file updates. While we solved a lot of problems by applying a WAL approach, one thing still remains: the WAL is stored in a file. A file might become corrupted. A file change may be residing in the write-behind buffer of the OS when power is cut off, causing us to lose the data.
Here's where I am stuck at the moment.
fsync() is a rather expensive operation (i.e. it takes a comparably long time to complete), therefore we cannot afford to perform an
fsync() on the data file for every transaction. Luckily, we can batch them in the WAL file, and flush periodically. However, as the WAL is a file too, how do we ensure that our changes to the WAL file are actually written to disk? We have to call
fsync() on the WAL file too. It was quite shocking for me to discover that major databases out there (e.g. PostGreSQL) actually call
fsync() on the WAL file periodically. That means: even if you successfully called
commit on a transaction, your data may still be lost in case of power loss. This is clearly violating the Durability aspect, but it seems to be a trade-off between performance and safety which database engineers were forced to make.
Still, there are a lot of open questions. For example, as the WAL is only a file, how do you protect yourself against the corruption of the WAL itself? I've seen approaches utilizing checksums: if the checksum for a commit doesn't match, then the commit is discarded entirely because the relevant section of the WAL file may have been corrupted. Needless to say, this again violates the Durability principle, as we are discarding data that has been written by a successfully committed transaction.
I find any information regarding this topic to be extremely hard to come by. Maybe I'm using the wrong search terms. Any pointers to more research material in this direction, or any advice (in case we have database engineers around here) would be highly appreciated.