Frits Hoogland for YugabyteDB

Posted on Sep 5, 2022 • Updated on Nov 7, 2022

Postgres memory allocation and OS memory allocation

#postgres #linux #performance #yugabyte

This is a technical investigation into memory allocation at the PostgreSQL and linux (memory allocation/malloc()) levels. Any comments are welcomed!

A postgres backend functions as a standalone operating system process that is forked from the postmaster. This is actually a very memory efficient operation on linux, because a lot of pages can be shared between the postmaster and the newly forked PostgreSQL process. However, as soon as the backend process starts executing code itself to bootstrap the process initialisation in general but for PostgreSQL in our case, it allocates and writes its own memory pages, which are unique to the process and remain paged in, and thus add to the RSS and are not shared.

The PostgreSQL processing memory areas are administered by PostgreSQL, which uses its own palloc() allocator, which calls malloc(), but provides some extra services, such as an administration of memory area's. This is very helpful, because this allows diagnosis of these memory area's.

This investigation is executed with a PostgreSQL server compiled from the source of version 11. I written a small PLpgSQL anonymous code block to do nothing else than filling an array, for the sake of allocating memory. This is the source code of that:

do $$
  declare
    array text[];
    counter int:=1;
    max_counter int:=5000;
  begin
    loop
      array[counter]:=repeat('x',16384);
      counter:=counter+1;
      if counter = max_counter then
        exit;
      end if;
    end loop;
    perform pg_sleep(60);
  end $$;

It really is quite simply looping over an array that gets filled with repeat('x',16384), and once it finished max_counter loops, it calls pg_sleep(60) to have the array being allocated for some time for analysis. (and pg_sleep() can easily be interrupted with ctrl-c)

When I create a postgres backend by logging in with psql, it occupies 18M, but proportionally (because of the shared pages) 7.9M:

# smem -k | grep -e [P]ID -e 6197
  PID User     Command                         Swap      USS      PSS      RSS
 6197 postgres postgres: postgres postgres        0     5.5M     7.9M    18.0M

When I run the anonymous PLpgSQL block, The size increases because of the allocations for the array:

# smem -k | grep -e [P]ID -e 6197
  PID User     Command                         Swap      USS      PSS      RSS
 6236 postgres postgres: postgres postgres        0    85.1M    87.6M    97.9M

The memory allocations for RSS and PSS increased by roughly 80M, which is not shocking: the array data needs to be stored.

We can see the the PostgreSQL memory allocations via gdb, the GNU debugger by calling print MemoryContextStats(TopMemoryContext). For version 11 this is the only way that I am aware that is reasonably convenient to get the PostgreSQL level memory details and sizes:

# gdb -p 6197
...
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007fe154a1d9eb in epoll_wait () from /lib64/libc.so.6
(gdb) print MemoryContextStats(TopMemoryContext)
$1 = void
(gdb) c
Continuing.

The command prints the memory details in the PostgreSQL log file:

TopMemoryContext: 67424 total in 5 blocks; 12256 free (7 chunks); 55168 used
...
  TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
    PortalContext: 1024 total in 1 blocks; 448 free (0 chunks); 576 used:
      SPI Proc: 32768 total in 3 blocks; 11528 free (2 chunks); 21240 used
        SPI TupTable: 8192 total in 1 blocks; 6544 free (0 chunks); 1648 used
        PLpgSQL per-statement data: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
        expanded array: 82299680 total in 5006 blocks; 32040 free (29 chunks); 82267640 used
...
Grand total: 83565536 bytes in 5233 blocks; 401536 free (148 chunks); 83164000 used

This shows the start of the MemoryContextStats() dump, where each memory context lists its own stats, so the TopMemoryContext does not include the array allocation.
A little further in the TopPortalContext the allocation for the array can be found in 'expanded array', which shows 82267640 bytes (78.5M)in use. This is all logical and reasonable.

When the call to sleep expires, the anonymous block ends. This releases any memory allocated by the PLpgSQL procedure and thus the array. This is completely logical, and to be sure we can perform the same call to MemoryContextStats() to validate PostgreSQL level allocations:

TopMemoryContext: 67424 total in 5 blocks; 13888 free (13 chunks); 53536 used
...
  TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
(no further sub allocations in TopPortalContext)
...
Grand total: 1067136 bytes in 161 blocks; 289296 free (113 chunks); 777840 used

The grand total shows 1067136 bytes, which is approximately 1M which is allocated to the backend.

However: if we look at the linux level allocation with smem again:

# smem -k | grep -e [P]ID -e 6236
  PID User     Command                         Swap      USS      PSS      RSS
 6236 postgres postgres: postgres postgres        0    66.4M    68.8M    79.2M

There still is much memory memory allocated to the process, which is approximately 60M, for which we know from the PostgreSQL memory dump is not allocated by PostgreSQL anymore. Where did it go?

Luckily, there is some help here too. PosgreSQL uses it's own memory allocation logic via palloc(), but that calls malloc() eventually. malloc has some diagnostics that can be called to see what is going on. One of these diagnostics is the function malloc_stats(). This can be called when attaching with gdb, and call that function inside the process using gdb, which is the same principle as the PostgreSQL function call to MemoryContextStats():

(gdb) call (void) malloc_stats()

This will print an overview from malloc, which sits in between the operating system and PostgreSQL as user land executable:

Arena 0:
system bytes     =   63832064
in use bytes     =     917696
Total (incl. mmap):
system bytes     =   71811072
in use bytes     =    8896704
max mmap regions =          3
max mmap bytes   =    7979008

This explains the hiatus between the operating system reporting 70M and the PostgreSQL level memory dump saying it released memory down to approximately 1M!

What we see is 'Arena 0', which is roughly put the administration of memory allocations of malloc() for this process, which has allocated from 'system' 63832064 bytes (60.9M), whilst actually in use (by PostgreSQL) is 917696 bytes (1M). What malloc() tries to do, is keep memory allocated to prevent having to deallocate and allocate over and over.

The question is how this works further down the line; I tried running some simple SQLs to see if this would trigger malloc() to release more memory, which it didn't do. My idea was that that the simple SQLs showed the execution didn't need all of that memory anymore.

There are descriptions that can be found on forums that say that the memory not in use is marked with madvise() calls. I don't know if that can be seen on the linux level; and if it can, if that would mean the /proc/PID/smaps statistic LazyFree should be set indicating there is memory still allocated that can be freed by the OS? (I also did not find madvise() calls being executed by the PostgreSQL backend that performed the PLpgSQL code, so it does not seem likely to me this happens; I validated this by breaking on madvise())

Based on what linux tells me in the smaps file, it seems the pages being private paged in pages cannot be reused. But I hope someone can tell me if this indeed is the case, the backend needs to stop to truly release the memory for reuse, or if there is something that I have not seen yet? I am happy to add it to this investigation!

ps.
the smem utility can be found in EPEL.

malloc_info:

There is another function that can be used inside a program that uses malloc (ptmalloc) to obtain the status of memory, which is the function malloc_info.

If you want to use this for PostgreSQL and trigger it to write to the PostgreSQL log file, perform calling the function in the following way:

(gdb) call (int) fopen("/tmp/malloc_info.txt", "wb")
$1 = 14729056
(gdb) call (int) malloc_info(0,$1)
$2 = 0
(gdb) call (int) fclose($1)
$3 = 0

This results in malloc level allocation administration results:

$ cat /tmp/malloc_info.txt
<malloc version="1">
<heap nr="0">
<sizes>
  <size from="65" to="65" total="65" count="1"/>
  <size from="209" to="209" total="209" count="1"/>
  <size from="369" to="369" total="369" count="1"/>
  <size from="785" to="785" total="785" count="1"/>
  <unsorted from="1041" to="23595025" total="62780003" count="19"/>
</sizes>
<total type="fast" count="0" size="0"/>
<total type="rest" count="24" size="62914999"/>
<system type="current" size="63832064"/>
<system type="max" size="83521536"/>
<aspace type="total" size="63832064"/>
<aspace type="mprotect" size="63832064"/>
</heap>
<total type="fast" count="0" size="0"/>
<total type="rest" count="24" size="62914999"/>
<total type="mmap" count="3" size="7979008"/>
<system type="current" size="63832064"/>
<system type="max" size="83521536"/>
<aspace type="total" size="63832064"/>
<aspace type="mprotect" size="63832064"/>
</malloc>

Which does not seem to show memory area's actually in use, but rather the malloc administration of the chunks.

tcmalloc and yugabyte

If you are on yugabyte, the memory allocator is not ptmalloc, but tcmalloc, a different memory allocator.

This allocator also has a facility to show information about its memory administration, but requires different commands. It does require attaching with gdb to execute the command:

(gdb) print yb::TcMallocStats()
$4 = {<std::__1::__basic_string_common<true>> = {<No data fields>}, static __short_mask = 1, static __long_mask = 1,
...etc...
(gdb) printf "%s", $4.__r_.__value_.__l.__data_
------------------------------------------------
MALLOC:       20362064 (   19.4 MiB) Bytes in use by application
MALLOC: +      6537216 (    6.2 MiB) Bytes in page heap freelist
MALLOC: +       836400 (    0.8 MiB) Bytes in central cache freelist
MALLOC: +       270848 (    0.3 MiB) Bytes in transfer cache freelist
MALLOC: +      1419136 (    1.4 MiB) Bytes in thread cache freelists
MALLOC: +      2621440 (    2.5 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =     32047104 (   30.6 MiB) Actual memory used (physical + swap)
MALLOC: +      4718592 (    4.5 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =     36765696 (   35.1 MiB) Virtual address space used
MALLOC:
MALLOC:            757              Spans in use
MALLOC:             10              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
------------------------------------------------
Total size of freelists for per-thread caches,
transfer cache, and central cache, by size class
------------------------------------------------
class   1 [        8 bytes ] :      586 objs;   0.0 MiB;   0.0 cum MiB;    0.000 overhead MiB;    0.000 cum overhead MiB
...etc...

DEV Community

Postgres memory allocation and OS memory allocation

malloc_info:

tcmalloc and yugabyte

Top comments (0)

Read next

How using Server-Timing API helped bring > 70% perf improvement

Guide to using ‘ed’ editor in Linux

Creating Package-Specific Local Repository in Rocky Linux 9

Exploring Data Visualization with NumPy