Log sync latency explained

#yugabytedb #performance #internals #linux

With the YugabyteDB database, we persist data by guaranteeing the data changes, such inserts, updates, deletes, etc. to be persisted. As with many databases and other programs that have this requirement, we do this by writing it to a file and call fsync().

However, there is a thing that I wondered about for a long time: if I calculate the time spent on the fsync() call when doing batch oriented work with YugabyteDB, the time spent waiting is often 20ms or more.

This time looks high. And you have to wait for fsync(), because only if it finishes successfully, the data is guaranteed to be stored on persistent media.

Let's follow the work that is done down into the linux kernel, and see if we can find a reason!

The first step is to see what exactly we do in YugabyteDB. For normal write calls, for which we use the write() system call, this is problematic because the Yugabyte database is a threaded application, and we use write() on many occasions, which would give lots and lots of data.

But for writing to the YugabyteDB WAL files, we use the writev() (vector write) system call. The writev() alias vector write call allows multiple requests to be submitted in one go synchronously. Below we see the vector writes submitting two write requests each.

So to see the WAL writing and fsync activity, we can use the strace utility and only show the writev and fsync() system calls. This is how that looks like:

$ strace -fp $(pgrep tserver) -e writev,fsync
...
[pid  1995] writev(121, [{iov_base="\20\0\0\0\235+\346\235:\0\222\357", iov_len=12}, {iov_base="\22\5\10\2\20\226U\30\260\261\277\311\242\377\2042", iov_len=16}], 2) = 28
[pid  1995] writev(121, [{iov_base="g\314\30\0\320v\332\237\244)\304\177", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\227U\21\0\2006\373\232\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid  1995] fsync(121)                  = 0
[pid  1995] writev(121, [{iov_base="\20\0\0\0\337\21N\212\315\357C\300", iov_len=12}, {iov_base="\22\5\10\2\20\227U\30\225\231\350\341\242\377\2042", iov_len=16}], 2) = 28
[pid  1995] writev(121, [{iov_base="g\314\30\0\327]?\33\377\36\3446", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\230U\21\0\320I\10\233\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid  1995] fsync(121)                  = 0
[pid  1995] writev(121, [{iov_base="\20\0\0\0\221\f\260\351\367h_\350", iov_len=12}, {iov_base="\22\5\10\2\20\230U\30\251\245\371\375\242\377\2042", iov_len=16}], 2) = 28
[pid  1995] writev(121, [{iov_base="g\314\30\0\26\234\252W\304g\262n", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\231U\21\0p\7\24\233\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid  1995] fsync(121)                  = 0
...

So -in general, not strictly!-, during a batch operation such as a batch insert, there's two writev() calls, one small call, about 28 bytes (the last number of the total size of the writev() request), and one big, which is 1625203 bytes.

This is where things start to get fuzzy, at least for me. The big write that is performed is seems to be 1625203 bytes, which is a little more than 1.5 MB. The YugabyteDB threshold for performing fsync is 1 second or 1 MB, but this is approximately 50% more?

But still, this is still reasonably known territory, right? We write data buffered, and then perform fsync(), there is no reason that that would take 20ms....or is it?

The next layer to look at is the block IO layer, or "bio" layer. To look at this layer, there are some brilliant tools in the bcc tools package, which use linux eBPF.

The tool to use here is biosnoop: a tool to 'snoop' on the BIO layer.

This is how that looks like for the same requests for which we looked at with strace:

TIME(s)     COMM           PID    DISK    T SECTOR     BYTES  LAT(ms)

15.492545   append [worker 1668   nvme1n1 W 107830376  4096      0.69
15.493980   append [worker 1668   nvme1n1 W 107830384  262144    2.08
15.495007   append [worker 1668   nvme1n1 W 107830896  262144    3.08
15.495918   append [worker 1668   nvme1n1 W 107831408  262144    3.95
15.497194   append [worker 1668   nvme1n1 W 107831920  262144    5.19
15.497705   append [worker 1668   nvme1n1 W 107832432  262144    5.67
15.499456   append [worker 1668   nvme1n1 W 107832944  262144    7.39
15.500215   append [worker 1668   nvme1n1 W 107833456  53248     8.14
15.507690   append [worker 1668   nvme1n1 W 104958050  1024      0.77
15.512486   append [worker 1668   nvme1n1 W 107833552  4096      0.59
15.514076   append [worker 1668   nvme1n1 W 107833560  262144    2.14
15.515213   append [worker 1668   nvme1n1 W 107834072  262144    3.24
15.515951   append [worker 1668   nvme1n1 W 107834584  262144    3.95
15.516999   append [worker 1668   nvme1n1 W 107835608  200704    4.90
15.517053   append [worker 1668   nvme1n1 W 107835096  262144    5.01
15.517995   append [worker 1668   nvme1n1 W 107836512  110592    5.85
15.518299   append [worker 1668   nvme1n1 W 107836000  262144    6.17
15.529624   append [worker 1668   nvme1n1 W 104958052  4096      0.62

This snippet from biosnoop shows the requests as the block IO layer sees it. Above we see the result of two times the combination of writev()/writev()/fsync().

But there are many more requests here? If we carefully look, we see that the first request is a request of 4096 bytes. That is probably the first writev(), although bigger.

But the important thing to see are the requests that follow that, each with a size of 262144 bytes, alias 256 kB.

Each of the bigger requests of 256k shows a latency of approximately 3ms or higher. This is the average IO latency time that I recognise, but you can also clearly see here that smaller requests get a better latency, and thus bigger ones have worse latency.

But where comes the 262144 / 256k come from?

The answer to that comes with the linux kernel files that are part of the disk subsystem and drivers in /sys/block/<blockdevice>/queue. This is a meta-directory, and contains information and settings about the disk.

The thing to look for is max_sectors_kb:

# cat max_sectors_kb
256

This means the maximum IO size is 256 kB! So the kernel broke down the large requests in the biggest size it is set to handle, which here is 256 kB!

Let's see if we can increase the IO size: the maximum allowed size is specified in max_hw_sectors_kb:

# cat max_hw_sectors_kb
256

Nope. We can't increase the IO size, it's already at its maximum. :-(
In case you're wondering what I run on, this is Amazon EC2 c5.

So now combining/going back to the requests that we measured using biosnoop: these are the requests from the large writev() request, but are broken down to 256kB chunks because of the maximum IO size, and then submitted to the block device.

Each of these requests takes approximately 3ms or higher, and there seems to be a tendency for successive large IO requests to get increasing latency.

Overall, this explains the fsync() latency that we see, these are the result of the chunks of 256kB that the fsync() call has to dispatch and wait for, which adds up to the 20ms or higher that we see with batches, when we fsync after (more than) 1M.