YugabyteDB: PostgreSQL memory optimization

#postgres #yugabyte #performance #internals

This blogpost is about a memory optimization that is applied to YugabyteDB version 2.15.3.0 that we applied that can make a huge difference in common situations with large amounts of memory allocated to multiple backends.

I laid the groundwork for investigating PostgreSQL and YugabyteDB here.

This is a further investigation, and some additional remarks.

Tricked by swap

First of all, swap makes things less obvious. I did some testing in PostgreSQL by using the memory allocating anonymous PLpgSQL block in the other blog, and allocated 100,000 chunks of 16384 characters in an array, watch the size via:

ps -o comm,vsz,rss -p PID

Note: this is ptmalloc, the standard Linux/glibc malloc implementation.

I mainly watch the RSS size. What I saw is that it grows when the array in PLpgSQL is allocated, and the size declines when the anonymous block is terminated, but leaves a certain size (more on "certain" later). This looks like this:

$ ps -o comm,vsz,rss -p 7934
COMMAND            VSZ   RSS
postgres        403704 62772

Alias: 63M still resident after the array allocation was terminated and freed at the PostgreSQL level.

I then dumped the ptmalloc statistics using call (void) malloc_stats(), and obtained the details in the PostgreSQL server log file:

Arena 0:
system bytes     =  230248448
in use bytes     =     922032
Total (incl. mmap):
system bytes     =  230649856
in use bytes     =    1323440
max mmap regions =          3
max mmap bytes   =    1224704

What? This says 230649856 is still allocated, which is roughly 231M??

I guess the heading of the paragraph gave it way, but it turned out that on my small PostgreSQL test virtual machine lots of the pages were swapped to disk. The purpose of swapping is to make room by moving allocated pages to the swap device. And if they are swapped, obviously they are not resident anymore, and therefore the RSS was lower:

cat /proc/7934/smaps_rollup
Rss:               63172 kB
Pss:               59816 kB
...
Swap:             168440 kB
SwapPss:          168303 kB
Locked:                0 kB

Rss+Swap pretty much equals Total system bytes.

ptmalloc tcache; leaving "certain size" allocated

The total allocated size when the array is in use is:

ps -o comm,vsz,rss -p 7934
COMMAND            VSZ   RSS
postgres        1788872 1427444

Which is 1.4G of memory resident for the process.

Like we saw above: once the memory is not needed anymore, it is freed on the PostgreSQL level, which calls free() to tell the operating system that it doesn't need the memory anymore.

"The operating system" in reality is not the operating system, but the ptmalloc memory allocator which sits between PostgreSQL and the operating system.

The ptmalloc allocator does free memory: once the memory is deallocated by PostgreSQL, the allocated size goes from 1.4G to 231M. But not all of that 231M is in use. In fact: the malloc_stats show that lots of this memory is not actually in use:

system bytes     =  230649856
in use bytes     =    1323440

System (allocated size): 231M, actually in use: 1.3M.

Why is that? This turns out to be a property of the tcache, or thread local cache. This is a small collection of chunks of memory that is kept to allow quick access without the need for concurrency protection to obtain it.

If we take a more detailed view into the malloc allocations using the malloc_info() function, we see:

<malloc version="1">
<heap nr="0">
<sizes>
  <size from="65" to="65" total="65" count="1"/>
  <size from="785" to="785" total="785" count="1"/>
  <size from="1041" to="1041" total="10410" count="10"/>
  <size from="2081" to="2081" total="2081" count="1"/>
  <size from="32817" to="32817" total="32817" count="1"/>
  <size from="1909969" to="64458145" total="229119079" count="7"/>
  <unsorted from="21601" to="21601" total="21601" count="1"/>
</sizes>
<total type="fast" count="0" size="0"/>
<total type="rest" count="23" size="229320534"/>
<system type="current" size="230256640"/>
...

The important part here is in the overview with this line:

  <size from="1909969" to="64458145" total="229119079" count="7"/>

These are 7 chunks with sizes between 1.9M and 64.4M of which 7 chunks have been preserved. Why 7? To the best of my knowledge, this is a setting in ptmalloc. This is the tcache or thread local cache preserving some chunks for potential reuse. This is the reason a PostgreSQL backend has a certain amount of memory it keeps allocated, despite PostgreSQL having free()-d it.

A low level description of ptmalloc allocation can be found here.

YugabyteDB memory allocation prior to version 2.15.3.0

At Yugabyte, we use another memory allocator, tcmalloc, that is more aggressive than ptmalloc. If we perform the exact same anonymous PLpgSQL block on YugabyteDB YSQL, it will allocate the same amount of memory for the array, because the exact same thing is happening. That should not be a surprise because we reuse the PostgreSQL codebase, so it's the same code.

However, in YugabyteDB versions prior to 2.15.3.0, the allocated memory for execution would be freed by PostgreSQL, because it's doing the same as native PostgreSQL, but at the memory allocator level it would keep more memory allocated than ptmalloc, because the tcmalloc allocator is much more aggressive. This is in fact an optimization for being able to reuse the memory with the least amount of effort. tcmalloc is optimised for threading, alias concurrent access to the memory.

Freeing lesser memory can lead to memory wastage because memory that has been allocated in the database connection's past might not be necessary for its current processing, but still, it is kept allocated. This is true for every single YSQL connection.

YugabyteDB memory allocation in version 2.15.3.0 and higher

The tcmalloc native memory usage optimization might lead to inefficient memory usage. This inefficiency has been noticed, and the memory management has been improved with YugabyteDB version 2.15.3.0.

When we perform the same allocation in YSQL with version 2.15.3.0, the same array is allocated, and the exact same thing is happening as before. What is different is that after the PostgreSQL memory allocation and freeing of it, YugabyteDB will free as much memory as possible:

Memory overview before running the array allocation:

$ ps -o cmd,vsz,rss -p 7659
CMD                            VSZ   RSS
postgres: yugabyte yugabyte 2361544 42740

Running the anonymous PLpgSQL block that allocates the array:

$ ps -o cmd,vsz,rss -p 7659
CMD                            VSZ   RSS
postgres: yugabyte yugabyte 2361544 423200

Once the array allocation expires (or is terminated):

$ ps -o cmd,vsz,rss -p 7659
CMD                            VSZ   RSS
postgres: yugabyte yugabyte 2361544 39120

The amount that is freed is variable, because in YugabyteDB we do not only have the postgres process, but also the YugabyteDB threads, which each require memory. So if you test and monitor this, you will see the amount of memory that is left after work has been performed to be variable.

Conclusion

The purpose of this blogpost is to detail the memory allocation of PostgreSQL further beyond the previous post.

Second, introduce the optimisation that was added with YugabyteDB version 2.15.3.0, which frees memory allocations not only on the database level, but also on the level of the memory allocator. This makes YugabyteDB YSQL processes reduce memory even more than PostgreSQL processes when lots of memory have been allocated.