This blogpost is about a memory optimization that is applied to YugabyteDB version 18.104.22.168 that we applied that can make a huge difference in common situations with large amounts of memory allocated to multiple backends.
I laid the groundwork for investigating PostgreSQL and YugabyteDB here.
This is a further investigation, and some additional remarks.
First of all, swap makes things less obvious. I did some testing in PostgreSQL by using the memory allocating anonymous PLpgSQL block in the other blog, and allocated 100,000 chunks of 16384 characters in an array, watch the size via:
ps -o comm,vsz,rss -p PID
Note: this is ptmalloc, the standard Linux/glibc malloc implementation.
I mainly watch the RSS size. What I saw is that it grows when the array in PLpgSQL is allocated, and the size declines when the anonymous block is terminated, but leaves a certain size (more on "certain" later). This looks like this:
$ ps -o comm,vsz,rss -p 7934 COMMAND VSZ RSS postgres 403704 62772
Alias: 63M still resident after the array allocation was terminated and freed at the PostgreSQL level.
I then dumped the ptmalloc statistics using
call (void) malloc_stats(), and obtained the details in the PostgreSQL server log file:
Arena 0: system bytes = 230248448 in use bytes = 922032 Total (incl. mmap): system bytes = 230649856 in use bytes = 1323440 max mmap regions = 3 max mmap bytes = 1224704
What? This says 230649856 is still allocated, which is roughly 231M??
I guess the heading of the paragraph gave it way, but it turned out that on my small PostgreSQL test virtual machine lots of the pages were swapped to disk. The purpose of swapping is to make room by moving allocated pages to the swap device. And if they are swapped, obviously they are not resident anymore, and therefore the RSS was lower:
cat /proc/7934/smaps_rollup Rss: 63172 kB Pss: 59816 kB ... Swap: 168440 kB SwapPss: 168303 kB Locked: 0 kB
Rss+Swap pretty much equals Total system bytes.
The total allocated size when the array is in use is:
ps -o comm,vsz,rss -p 7934 COMMAND VSZ RSS postgres 1788872 1427444
Which is 1.4G of memory resident for the process.
Like we saw above: once the memory is not needed anymore, it is freed on the PostgreSQL level, which calls
free() to tell the operating system that it doesn't need the memory anymore.
"The operating system" in reality is not the operating system, but the ptmalloc memory allocator which sits between PostgreSQL and the operating system.
The ptmalloc allocator does free memory: once the memory is deallocated by PostgreSQL, the allocated size goes from 1.4G to 231M. But not all of that 231M is in use. In fact: the malloc_stats show that lots of this memory is not actually in use:
system bytes = 230649856 in use bytes = 1323440
System (allocated size): 231M, actually in use: 1.3M.
Why is that? This turns out to be a property of the tcache, or thread local cache. This is a small collection of chunks of memory that is kept to allow quick access without the need for concurrency protection to obtain it.
If we take a more detailed view into the malloc allocations using the
malloc_info() function, we see:
<malloc version="1"> <heap nr="0"> <sizes> <size from="65" to="65" total="65" count="1"/> <size from="785" to="785" total="785" count="1"/> <size from="1041" to="1041" total="10410" count="10"/> <size from="2081" to="2081" total="2081" count="1"/> <size from="32817" to="32817" total="32817" count="1"/> <size from="1909969" to="64458145" total="229119079" count="7"/> <unsorted from="21601" to="21601" total="21601" count="1"/> </sizes> <total type="fast" count="0" size="0"/> <total type="rest" count="23" size="229320534"/> <system type="current" size="230256640"/> ...
The important part here is in the overview with this line:
<size from="1909969" to="64458145" total="229119079" count="7"/>
These are 7 chunks with sizes between 1.9M and 64.4M of which 7 chunks have been preserved. Why 7? To the best of my knowledge, this is a setting in ptmalloc. This is the tcache or thread local cache preserving some chunks for potential reuse. This is the reason a PostgreSQL backend has a certain amount of memory it keeps allocated, despite PostgreSQL having free()-d it.
A low level description of ptmalloc allocation can be found here.
At Yugabyte, we use another memory allocator, tcmalloc, that is more aggressive than ptmalloc. If we perform the exact same anonymous PLpgSQL block on YugabyteDB YSQL, it will allocate the same amount of memory for the array, because the exact same thing is happening. That should not be a surprise because we reuse the PostgreSQL codebase, so it's the same code.
However, in YugabyteDB versions prior to 22.214.171.124, the allocated memory for execution would be freed by PostgreSQL, because it's doing the same as native PostgreSQL, but at the memory allocator level it would keep more memory allocated than ptmalloc, because the tcmalloc allocator is much more aggressive. This is in fact an optimization for being able to reuse the memory with the least amount of effort. tcmalloc is optimised for threading, alias concurrent access to the memory.
Freeing lesser memory can lead to memory wastage because memory that has been allocated in the database connection's past might not be necessary for its current processing, but still, it is kept allocated. This is true for every single YSQL connection.
The tcmalloc native memory usage optimization might lead to inefficient memory usage. This inefficiency has been noticed, and the memory management has been improved with YugabyteDB version 126.96.36.199.
When we perform the same allocation in YSQL with version 188.8.131.52, the same array is allocated, and the exact same thing is happening as before. What is different is that after the PostgreSQL memory allocation and freeing of it, YugabyteDB will free as much memory as possible:
Memory overview before running the array allocation:
$ ps -o cmd,vsz,rss -p 7659 CMD VSZ RSS postgres: yugabyte yugabyte 2361544 42740
Running the anonymous PLpgSQL block that allocates the array:
$ ps -o cmd,vsz,rss -p 7659 CMD VSZ RSS postgres: yugabyte yugabyte 2361544 423200
Once the array allocation expires (or is terminated):
$ ps -o cmd,vsz,rss -p 7659 CMD VSZ RSS postgres: yugabyte yugabyte 2361544 39120
The amount that is freed is variable, because in YugabyteDB we do not only have the postgres process, but also the YugabyteDB threads, which each require memory. So if you test and monitor this, you will see the amount of memory that is left after work has been performed to be variable.
The purpose of this blogpost is to detail the memory allocation of PostgreSQL further beyond the previous post.
Second, introduce the optimisation that was added with YugabyteDB version 184.108.40.206, which frees memory allocations not only on the database level, but also on the level of the memory allocator. This makes YugabyteDB YSQL processes reduce memory even more than PostgreSQL processes when lots of memory have been allocated.