DEV Community

Frits Hoogland for YugabyteDB

Posted on

The mirage of memory

Modern operating systems like linux are really sophisticated. This sophistication means that sometimes things are actually different than they make you believe. One of these things is memory usage.

Linux, like most modern operating systems, use virtual memory. That means that every distinct process has got its address space, which is completely private to the process, and its threads, if it is using threads.

Getting linux memory statistics

For every used memory segment, linux keeps track of the allocations, which is kind of obvious of course, but also exposes the allocations in the proc meta-filesystem in a file called 'maps'. These are my allocations when I run 'cat' to view it:

$ cat /proc/self/maps
55e0495ab000-55e0495b3000 r-xp 00000000 fd:00 401342                     /usr/bin/cat
55e0497b2000-55e0497b3000 r--p 00007000 fd:00 401342                     /usr/bin/cat
55e0497b3000-55e0497b4000 rw-p 00008000 fd:00 401342                     /usr/bin/cat
55e04a1c0000-55e04a1e1000 rw-p 00000000 00:00 0                          [heap]
7f23fbf4a000-7f23fc1c2000 r--p 00000000 fd:00 882                        /usr/lib/locale/en_US.utf8/LC_COLLATE
7f23fc1c2000-7f23fc37e000 r-xp 00000000 fd:00 33564883                   /usr/lib64/libc-2.28.so
7f23fc37e000-7f23fc57d000 ---p 001bc000 fd:00 33564883                   /usr/lib64/libc-2.28.so
7f23fc57d000-7f23fc581000 r--p 001bb000 fd:00 33564883                   /usr/lib64/libc-2.28.so
7f23fc581000-7f23fc583000 rw-p 001bf000 fd:00 33564883                   /usr/lib64/libc-2.28.so
7f23fc583000-7f23fc587000 rw-p 00000000 00:00 0
7f23fc587000-7f23fc5b3000 r-xp 00000000 fd:00 33564876                   /usr/lib64/ld-2.28.so
7f23fc72b000-7f23fc74d000 rw-p 00000000 00:00 0
7f23fc74d000-7f23fc7a0000 r--p 00000000 fd:00 883                        /usr/lib/locale/en_US.utf8/LC_CTYPE
7f23fc7a0000-7f23fc7a1000 r--p 00000000 fd:00 886                        /usr/lib/locale/en_US.utf8/LC_NUMERIC
7f23fc7a1000-7f23fc7a2000 r--p 00000000 fd:00 33564847                   /usr/lib/locale/en_US.utf8/LC_TIME
7f23fc7a2000-7f23fc7a3000 r--p 00000000 fd:00 33564845                   /usr/lib/locale/en_US.utf8/LC_MONETARY
7f23fc7a3000-7f23fc7aa000 r--s 00000000 fd:00 67614780                   /usr/lib64/gconv/gconv-modules.cache
7f23fc7aa000-7f23fc7ac000 rw-p 00000000 00:00 0
7f23fc7ac000-7f23fc7ad000 r--p 00000000 fd:00 33564857                   /usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7f23fc7ad000-7f23fc7ae000 r--p 00000000 fd:00 100664709                  /usr/lib/locale/en_US.utf8/LC_PAPER
7f23fc7ae000-7f23fc7af000 r--p 00000000 fd:00 885                        /usr/lib/locale/en_US.utf8/LC_NAME
7f23fc7af000-7f23fc7b0000 r--p 00000000 fd:00 33564842                   /usr/lib/locale/en_US.utf8/LC_ADDRESS
7f23fc7b0000-7f23fc7b1000 r--p 00000000 fd:00 33564846                   /usr/lib/locale/en_US.utf8/LC_TELEPHONE
7f23fc7b1000-7f23fc7b2000 r--p 00000000 fd:00 33564844                   /usr/lib/locale/en_US.utf8/LC_MEASUREMENT
7f23fc7b2000-7f23fc7b3000 r--p 00000000 fd:00 33564843                   /usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
7f23fc7b3000-7f23fc7b4000 r--p 0002c000 fd:00 33564876                   /usr/lib64/ld-2.28.so
7f23fc7b4000-7f23fc7b6000 rw-p 0002d000 fd:00 33564876                   /usr/lib64/ld-2.28.so
7ffcb6792000-7ffcb67b3000 rw-p 00000000 00:00 0                          [stack]
7ffcb67df000-7ffcb67e3000 r--p 00000000 00:00 0                          [vvar]
7ffcb67e3000-7ffcb67e5000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Enter fullscreen mode Exit fullscreen mode

(for the older folks that read this: this is what the pmap utility gives you too, which was used when the proc meta-filesystem and maps didn't exist, many years back)

As you can see, these are quite a lot of memory allocations. The allocations consists of the executable, for which the text (the code), the readonly data and the read-write data segments are loaded, the heap, shared libraries, which consists of the same segments as an executable, locale data and other allocations.

What if I told you the contents of these files aren't actually loaded into the address space, except for the first page, unless additional pages have been actually used? That is one reason that a process can be started so quickly by the operating system: it pages in ("loads") the bare minimum, and adds additional pages if these are actually requested.

The size of the actual loaded pages has a name, and is called 'RSS', which means 'resident set size'. But you shouldn't trust me, or anyone that tells you things, there should be proof. This proof can be found in linux in the proc meta-filesystem in a file called 'smaps'. This is how that looks like:

$ head /proc/self/smaps
5557aad59000-5557aad63000 r-xp 00000000 fd:00 404248                     /usr/bin/head
Size:                 40 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  40 kB
Pss:                  40 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:        40 kB
Private_Dirty:         0 kB
Enter fullscreen mode Exit fullscreen mode

For every memory segment, smaps shows the memory details. The first segment of the head executable is the text segment, which is readonly ('r-xp': read, executable, private), the text segment is always readonly. The 'Size' statistic is the total size, and is often referred to as 'virtual set size' or VSZ. The important bit to spot here is the Rss, which for the head executable also is 40kB. The head text segment is so small that all 10 pages are paged in.

Now look at another file: /usr/lib/locale/en_US.utf8/LC_COLLATE. I am not sure how static this is and thus if it will be requested on your machine (I am using Alma 8.5 x86_64), but it illustrates the point:

$ grep -A6 LC_COLLATE /proc/self/smaps
7f492a1ff000-7f492a477000 r--p 00000000 fd:00 882                        /usr/lib/locale/en_US.utf8/LC_COLLATE
Size:               2528 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  96 kB
Pss:                  93 kB
Shared_Clean:          4 kB
Enter fullscreen mode Exit fullscreen mode

This file is has a VSZ size of 2528 kB, yet 96 kB is actually resident, alias RSS size.

Postgresql allocation

At this point you might wonder what this has to do with Postgresql. Well, a lot actually...when postgres is started, it will be subject to the same lazy loading as I just described, including the postgres buffer cache, which can be set to a significant amount of memory.

On my test postgres instance it's actually setup default (128MB). This is how that looks like in smaps:

# grep -A6 deleted /proc/884/smaps
7f82f73c9000-7f8300175000 rw-s 00000000 00:05 22953                      /dev/zero (deleted)
Size:             145072 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:               10596 kB
Pss:                7235 kB
Shared_Clean:          0 kB
--
7f8308509000-7f830850a000 rw-s 00000000 00:05 0                          /SYSV0052e2c1 (deleted)
Size:                  4 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Enter fullscreen mode Exit fullscreen mode

These are the shared memory segments (visible as the fourth letter of the access properties being 's' for shared instead of 'p' for private memory). The first shared memory segment is the buffer cache, which is slightly larger than the set memory using the shared_buffers parameter.

As you can see with the Rss statistic, 10596 kB is allocated/resident from the 145072 kB total/VSZ size, that means that after startup only approximately 7% is paged in.

(System V) shared memory is special, and has shared memory specific limits in the linux kernel (shmmax, shmmax, shmmni).

memory allocation dangers

Here is a thing to consider: regular (non shared) allocations can allocate more than available memory, provided it's not actually paged in.

Let me show you how that looks like. For memory consumption tests, I created a small (open source) tool called eatmemory-rust. This allows you to allocate an amount of megabytes, and page it in or not. The tool executes a mmap() to define the total memory/VSZ (which is an rust implementation detail that can change with other versions).

$ strace -e mmap target/debug/eatmemory -s 2000 -a 10 -v
...
total memory        :       828 MB
available memory    :       416 MB
free memory         :       163 MB
used memory         :       233 MB
total swap          :      2158 MB
free swap           :      2117 MB
used swap           :        40 MB
creating vec with size 2000 MB
mmap(NULL, 2097156096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35507ed000
pointer to hog: 0x7f35507ed010
allocating vec for 10 MB (-1 means creation size)
total memory        :       828 MB
available memory    :       405 MB
free memory         :       152 MB
used memory         :       243 MB
total swap          :      2158 MB
free swap           :      2117 MB
used swap           :        40 MB
done. press enter to stop and deallocate
Enter fullscreen mode Exit fullscreen mode

First look at total memory. This is a small virtual machine that has 1024M set as total memory, for which 828M memory is available to linux because of the kernel.

The eatmemory tool is executed with strace to print (libc) mmap() calls. The eatmemory tool itself is set to create a vec with a size of 2000M, and the mmap function reflects the 2000M being allocated (2097156096). The actual allocation is 10M, and thus paged in.

If you look at the memory statistics, the available memory statistic is lowered with 11MB, which is in the ballpark of the allocated size. The 2000M did not actually perform any allocation or other work, it's the use of the memory (the allocation) that performed actually taking/paging in the memory.

Why would that be important?

First take a step back. This lazy allocation mechanism means that if you startup a machine, it does probably not have all the memory already allocated which it has when that same system has been running actual production for a few days.

I see lots of tests not taking this lazy allocation/demand paging mechanism into account and thus test with a different memory allocation than it will actually be running, whilst the test is meant to prove it runs well.

But there is an even more dangerous thing to think about.

It means multiple processes can allocate, but not yet page in, more memory than a system can provide (maybe even including swap). It then simply needs time to get memory pages being used/touched and thus paged in for a system to get low on memory, eat up its swap, and then run out of memory. Lots of people think that if a process is able to startup, it could allocate its resources. The above 'eatmemory' demo shows you this is not the case.

Discussion (0)