Frits Hoogland for YugabyteDB

Posted on Dec 14, 2021

The anatomy of a linux read call

#linux #performance #postgres #yugabyte

Reading data is a fundamental property of many applications, especially for databases, because their function is to manage data, and use a disk to persist data and read the stored data.

A lot of people would take the database's latency measurement, or the latency measured by a tool and take that as the fundamental IO speed. This blogpost is about what that figure means, and to help you in diagnose and think about that latency.

Know your IO: buffered and unbuffered/direct IO

A fundamental property that MUST be known is whether the read call is done buffered (using page cache of linux) or unbuffered (reading directly into the executiong process' address space). The IO call doesn't determine whether the IO is performed buffered or unbuffered, the file descriptor flags determine that.

The file descriptor flags are set with open() or openat() calls, and can also be changed with the fcntl() call inside the execute.

The file descriptor flags bitmap can be seen in the fdinfo field in /proc/[PID]/fdinfo/[file descriptor #]:

[root@yb-1 ~]# ls /proc/self/fdinfo
0  1  2  3
[root@yb-1 ~]# cat /proc/self/fdinfo/0
pos:    0
flags:  02002
mnt_id: 26

I haven't been able to find a simple explanation of the flags. There are two methods that I use to understand what these flags are. One is to trace the call with which the flag is opened with strace, which shows something like:

openat(AT_FDCWD, "/dev/sdb", O_RDONLY|O_DIRECT) = 3

The other one is to use the lsof utility with '+fg':

lsof +fg -p $(pgrep -x dd)
...
dd      32863 root    0r   BLK    DIR,LG   8,16 0x83b00000      1503 /dev/sdb
...

You can see the 'DIR' for 'O_DIRECT'.

Now that the methods to determine buffered or unbuffered IO are known: by default linux performs everything buffered. When a file is opened unbuffered alias with O_DIRECT it's an explicit choice of the application to do so.

This blogpost assumes buffered IO starting from here.

Linux layers

We established we're doing buffered IO. Now, how to understand the latency of a read call? There are several ways of measuring read latency, and some measurements take their data from different points in the operating system. Most operating systems are setup in layered way to keep different activities grouped and keep things organized.

A fundamental grouping is the separation of userspace, where an application is running, and kernelspace, where (obviously) the kernel is running, and the hardware/device layers.

#	Layer	Components
1	Userspace	user application
2	Userspace	GNU C library (glibc)
-	---	---
3	Kernelspace	System Call Interface
4	Kernelspace	Subsystems: virtual filesystem, memory management, process management
5	Kernelspace	Architecture Dependent Code, device drivers
-	---	---
6	Hardware	Physical devices

This is a random grouping that I created for the sake of the argument of measuring latency, you can probably create many more or different of such tables.

The table is meant to be read from the top to the bottom, so from userspace to the hardware layer, which is typically how a call might might flow.

Not all calls flow to the bottom. If code is performing work on data in already allocated memory, it will not go further than layer #2. This work is done in the user space layer. Typically, all CPU time spent there is recorded as 'user' CPU time.

Anytime when during processing there is something needed from outside the current user space, such as getting some more memory or freeing memory, reading something from a system device, such as from a network stack, but also from a disk, the process needs to call the system, which is done via a system call. This includes performing a disk IO.

The system call performs a mode switch to kernel mode. In kernel mode devices can be accessed. All CPU time spent in kernel mode is recorded as 'system' CPU.

Measuring IO

When you look at IO figures and latencies, you have understand two things. The first thing to understand is whether the measurement or measurements you are looking at is an average or an absolute measurement. Averages hide peaks, and the more time the measurement is taken over, the more averaged the figure is.

The other important thing is the above described layer the measurement comes from.

iostat, sar, node exporter or anything else that takes figures from /proc come from level 5.
bcc (ebpf) tools xfsdist and xfsslower come from from level 3.
bcc (ebpf) tools biosnoop, biolatency, biotop comes from level 5.
perf trace, strace come from level 3.
application figures come from level 1.

Measurements at different layers do include different amounts of work and thus might yield different figures. Also some tools can add (significant) latency (!!), such as strace and perf. Be extremely cautious in production situations!

I hope if you made it this far that these layers can have significance. In a lot of situations, when you encounter high latencies using figures obtained from for example application measured IO times, it simply means your disk device encountered that latency, and the latency figures are roughly equal over the the different layers.

How does a latency figure look like

But let's first look at latency figures. I said you have to know if a figure is an average or an absolute measurement. The reason for that is that IO latencies are not/never a single figure. They fluctuate.

This might come as a shock to some people, because lots (and I mean: LOTS) of people talk about IO latency stating a single figure. IO latency fluctuates, even with modern NVMe SSD, because there are lots and lots of things happening at the same on a modern computer system, which can and sometimes will influence your latency.

Also storage latency in modern clouds fluctuates, and you should/have to assess whether the fluctuations are acceptable. Some clouds have outliers that are quite high, but because of the low frequency of them still have an acceptable average latency.

If the fluctuations that I just described surprises you, and you thought that because you are running in a professional cloud means this is all consistent, please perform a fio benchmark on your cloud to find out. I know I was very surprised to see HDD era latencies. I am not kidding here!

Know your environment

In most cases there is no need to perform a deep dive into the kernel internals, and your physical IO latency simply was high. This can reasonably easy be established by trying to link latencies from the application and tools that obtained them from layer 3, and comparing them with latencies obtained from layer 5.

In a lot of cases that I seen there is a lacking understanding of disk bandwidth. The first observation is that people have a hard time understanding or appropriately reporting IOPS (operations per second) and MBPS (megabyte per second). My advise here is: do not fight the figures (perform measurements to see if these can be reached), but work with the figures (carefully use the figures to see how well your application performs IO).

The second observation is that I see a lot of people ignoring the fact that with a little bit of work you can understand the IO limits of your system. Please put in the effort to go to the cloud vendor's documentation and work out what IO limits you actually got. Without them you are flying blind.

In all cases make your environment, which in the cloud is CPU and IO, predictable and consistent. Do not use bursting. Any assessment done with any form of bursting can and should be moved to the bin because that alters the limits during your run. Essentialy, bursting makes you measure two different systems to which the bursting made your virtual machine switch to.

When IO limits are hit, both for IOPS and MBPS, in the cloud and outside of the cloud with local devices, requests will queue. The queueing meaning the request has to wait before it gets served, and therefore appears to the application layer as increased latency. If you want to know more about this, please read up on https://en.wikipedia.org/wiki/Queueing_theory.

Finally diving into the read call

When you encounter an issue and it provides inconsistent figures between userspace (level 1 or 3) and kernel device level (level 5), you might want to look deeper into the internals of a read call. This is rare!

The call I am diving into is pread64(). The pread64() call is what Postgresql is using on linux to read data into its addressing space, as well as the YugabyteDB.

Most linux tools will do a read() call, so I prepared a small c program that performs a pread64 call:

#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main() {
  int fd, sz;
  char *c = (char *) calloc(100, sizeof(char));

  fd = open("tt", O_RDONLY);
  if (fd < 0) { perror("tt"); exit(1); }

  sz = pread(fd, c, 10, 0);
  printf("called pread(% d, c, 10). returned that"
        " %d bytes were read.\n", fd, sz);
  c[sz] = '\0';
  printf("Those bytes are as follows: % s\n", c);
}

Save it in a file called 'pread.c'. You compile the program in the following way:

cc pread.c -o pread

And prepare the file it is going to read:

echo "x" > tt

You can validate it's doing pread64() call by running the program with strace:

# strace -e pread64 ./pread
pread64(3, "x\n", 10, 0)                = 2
called pread( 3, c, 10). returned that 2 bytes were read.
Those bytes are as follows: x

+++ exited with 0 +++

Now let's perform the actual deep dive. Using ftrace, it's possible obtain the exact functions executed inside the kernel.

ftrace is using the tracefs filesystem, which normally is mounted at /sys/kernel/debug/tracing. Inside tracefs, perform the following tasks:

For security, turn tracing off (should be off obviously):

echo 0 > tracing_on

Set the current tracer to 'function_graph':

echo function_graph > current_tracer

Enable tracing for the pread64 system call:

echo ksys_pread64 > set_graph_function

Trace annotation: add the program executable name and pid in the trace (funcgraph-proc), and add the name of the function to the end too (funcgraph-tail).

echo funcgraph-tail > trace_options
echo funcgraph-proc > trace_options

Turn tracing on:

echo 1 > tracing_on

Now go the other session and execute pread:

./pread
called pread( 3, c, 10). returned that 2 bytes were read.
Those bytes are as follows: x

Turn tracing off:

echo 0 > tracing_on

And look in the trace:

 1)  <...>-40173   |               |  ksys_pread64() {
 1)  <...>-40173   |               |    __fdget() {
 1)  <...>-40173   |   0.076 us    |      __fget_light();
 1)  <...>-40173   |   0.768 us    |    } /* __fdget */
 1)  <...>-40173   |               |    vfs_read() {
 1)  <...>-40173   |               |      rw_verify_area() {
 1)  <...>-40173   |               |        security_file_permission() {
 1)  <...>-40173   |               |          selinux_file_permission() {
 1)  <...>-40173   |               |            __inode_security_revalidate() {
 1)  <...>-40173   |               |              _cond_resched() {
 1)  <...>-40173   |   0.041 us    |                rcu_all_qs();
 1)  <...>-40173   |   0.362 us    |              } /* _cond_resched */
 1)  <...>-40173   |   0.696 us    |            } /* __inode_security_revalidate */
 1)  <...>-40173   |   0.042 us    |            avc_policy_seqno();
 1)  <...>-40173   |   1.390 us    |          } /* selinux_file_permission */
 1)  <...>-40173   |   0.042 us    |          bpf_lsm_file_permission();
 1)  <...>-40173   |   0.043 us    |          __fsnotify_parent();
 1)  <...>-40173   |   0.053 us    |          fsnotify();
 1)  <...>-40173   |   2.785 us    |        } /* security_file_permission */
 1)  <...>-40173   |   3.122 us    |      } /* rw_verify_area */
 1)  <...>-40173   |               |      __vfs_read() {
 1)  <...>-40173   |               |        new_sync_read() {
 1)  <...>-40173   |               |          xfs_file_read_iter [xfs]() {
 1)  <...>-40173   |               |            xfs_file_buffered_aio_read [xfs]() {
 1)  <...>-40173   |               |              xfs_ilock [xfs]() {
 1)  <...>-40173   |               |                down_read() {
 1)  <...>-40173   |               |                  _cond_resched() {
 1)  <...>-40173   |   0.040 us    |                    rcu_all_qs();
 1)  <...>-40173   |   0.362 us    |                  } /* _cond_resched */
 1)  <...>-40173   |   0.686 us    |                } /* down_read */
 1)  <...>-40173   |   1.042 us    |              } /* xfs_ilock [xfs] */
 1)  <...>-40173   |               |              generic_file_read_iter() {
 1)  <...>-40173   |               |                generic_file_buffered_read() {
 1)  <...>-40173   |               |                  _cond_resched() {
 1)  <...>-40173   |   0.041 us    |                    rcu_all_qs();
 1)  <...>-40173   |   0.383 us    |                  } /* _cond_resched */
 1)  <...>-40173   |               |                  pagecache_get_page() {
 1)  <...>-40173   |               |                    find_get_entry() {
 1)  <...>-40173   |   0.041 us    |                      PageHuge();
 1)  <...>-40173   |   1.151 us    |                    } /* find_get_entry */
 1)  <...>-40173   |   1.487 us    |                  } /* pagecache_get_page */
 1)  <...>-40173   |   0.042 us    |                  mark_page_accessed();
 1)  <...>-40173   |               |                  _cond_resched() {
 1)  <...>-40173   |   0.042 us    |                    rcu_all_qs();
 1)  <...>-40173   |   0.364 us    |                  } /* _cond_resched */
 1)  <...>-40173   |               |                  _cond_resched() {
 1)  <...>-40173   |   0.042 us    |                    rcu_all_qs();
 1)  <...>-40173   |   0.359 us    |                  } /* _cond_resched */
 1)  <...>-40173   |               |                  pagecache_get_page() {
 1)  <...>-40173   |               |                    find_get_entry() {
 1)  <...>-40173   |   0.041 us    |                      PageHuge();
 1)  <...>-40173   |   0.382 us    |                    } /* find_get_entry */
 1)  <...>-40173   |   0.708 us    |                  } /* pagecache_get_page */
 1)  <...>-40173   |               |                  touch_atime() {
 1)  <...>-40173   |               |                    atime_needs_update() {
 1)  <...>-40173   |               |                      current_time() {
 1)  <...>-40173   |   0.051 us    |                        ktime_get_coarse_real_ts64();
 1)  <...>-40173   |   0.044 us    |                        timestamp_truncate();
 1)  <...>-40173   |   0.707 us    |                      } /* current_time */
 1)  <...>-40173   |   1.076 us    |                    } /* atime_needs_update */
 1)  <...>-40173   |   1.399 us    |                  } /* touch_atime */
 1)  <...>-40173   |   7.264 us    |                } /* generic_file_buffered_read */
 1)  <...>-40173   |   7.627 us    |              } /* generic_file_read_iter */
 1)  <...>-40173   |               |              xfs_iunlock [xfs]() {
 1)  <...>-40173   |   0.041 us    |                up_read();
 1)  <...>-40173   |   0.389 us    |              } /* xfs_iunlock [xfs] */
 1)  <...>-40173   | + 10.019 us   |            } /* xfs_file_buffered_aio_read [xfs] */
 1)  <...>-40173   | + 10.454 us   |          } /* xfs_file_read_iter [xfs] */
 1)  <...>-40173   | + 10.863 us   |        } /* new_sync_read */
 1)  <...>-40173   | + 11.227 us   |      } /* __vfs_read */
 1)  <...>-40173   |   0.045 us    |      __fsnotify_parent();
 1)  <...>-40173   |   0.048 us    |      fsnotify();
 1)  <...>-40173   | + 15.714 us   |    } /* vfs_read */
 1)  <...>-40173   | + 17.359 us   |  } /* ksys_pread64 */

Depending on the amount of concurrency, you might need to find the correct pread64 call.

What do we see? At the bottom of the trace we see this pread64 that the this call took 17 microseconds. That is very fast. How can this be so fast?

There already is much going on however, these are some important functions:

security_file_permission: security related functions, normally no latency issues.
xfs_ilock: the inode is locked.
pagecache_get_page: lookup the page in the page cache.
xfs_iulock: the inode is unlocked.

That's right, there was no physical IO in this case! That also explains the low latency! Obviously, for this to happen the requested pages must be in the cache already, if they are not in the cache, they must be read from disk.

Join me for the next post about the anatomy of a read call that includes a physical read.

Top comments (2)

therealkevinc • Dec 15 '21 • Edited

17us for 2 bytes from a page cache buffer is a loooooooong time. I know you'll explain this though :)

Frits Hoogland • Dec 15 '21

Hi Kevin! Good to 'see' you!

There is not too much to explain here. For the sake of convenience I am doing these kinds of tests on a small virtual machine on my laptop, so the virtual machine is bound to the resources on the host. This indeed means that latencies in that scale might be different from running on an enterprise configuration.

The intention of this article is to show that there is an option to look into the kernel activity of doing a read call. Next week I'll be showing an actual physical read (for which I put the trace at the end of the blog, it's too long to have it in the text :-)), and explain the steps it's going through.