Measuring memory usage with yb_stats, part 1: OS memory

#watercooler

In YugabyteDB, yb_stats takes all available statistics and shows these (in ad-hoc mode) or saves these for later investigation (in snapshot mode). One of the topics that can be investigated, is memory and memory usage.

It's important to realise memory is visible from 3 different sources, and a fourth important source doesn't externalise it's memory usage: YSQL alias PostgreSQL.

Node exporter: Node exporter shows total machine memory statistics. That means that the node exporter view is a superset of information, of which the tablet server and master memory usage can be part of.
Tablet server: A tablet server shows statistics at two levels: at the level of the memory allocator, tcmalloc, and memory allocations tracked by the memtracker framework. This excludes operating system level usage, like mapped executable and libraries.
Master server: A master server shows statistics at two levels: at the level of the memory allocator, tcmalloc, and memory allocations tracked by the memtracker framework. This excludes operating system level usage, like mapped executable and libraries.
YSQL (PostgreSQL): The YSQL layer consists of pretty similar processes as PostgreSQL. For these, there is no general view that shows the YSQL only memory usage, just like this is not available in PostgreSQL. Newer versions of PostgreSQL have a heap function, and newer versions of YSQL do provide similar functions, but these are per process, and not externalised.

node exporter basic memory statistics

To see the basic operating system level memory statistics, the --gauges-enable flag must be used, because node exporter exports the memory statistics as such, and filter on node_memory_Mem. This is how that looks like:

➜ yb_stats --gauges-enable --stat-name-match 'node_memory_(Mem|Swap)' --hostname-match 80:9300
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:    1.476 seconds
192.168.66.80:9300   gauge    node_memory_MemAvailable_bytes                                              1433604096.000000         -151552
192.168.66.80:9300   gauge    node_memory_MemFree_bytes                                                    896311296.000000         -151552
192.168.66.80:9300   gauge    node_memory_MemTotal_bytes                                                  1900789760.000000              +0
192.168.66.80:9300   gauge    node_memory_SwapFree_bytes                                                  2206199808.000000              +0
192.168.66.80:9300   gauge    node_memory_SwapTotal_bytes                                                 2206199808.000000              +0

These are the statistics for a single node, in a cluster you would typically have more nodes. These statistics come from the linux /proc meta-filesystem, and the source is /proc/meminfo.

Description:

MemTotal: this is the total amount of memory available to the operating system, and therefore does not fluctuate.
MemFree: this is a statistic that hardly ever is useful, because this a minimal amount of memory that the kernel keeps free (set by vm.min_free_kbytes). During the machine lifetime, MemFree will lower until it reaches vm.min_free_kbytes, and then the kernel tries to keep it at that number. This is not an indicator of available memory.
MemAvailable: this is the most useful memory statistic of all: this statistic tells the amount of memory that the kernel considers to be available for immediate use. This number can fluctuate, but consistent low values means too much memory allocated. Low is approximately lower than 5%.
Swap: These are useful to understand if swap is in use, and if so, how much of the swap is used.

node exporter detailed memory statistics

The /proc/meminfo meta-file contains much more statistics. However, the gathering of figures in /proc/meminfo is quite diverse: there are several statistics that have in common that they are all about memory usage, but are independent groups of statistics. That also makes it hard to understand.

However, using the statistics in /proc/meminfo we can take some statistics that tell something.

anonymous and cached

➜ yb_stats --hostname-match 80:9300 --adhoc-metrics-diff --gauges-enable --stat-name-match 'memory_(MemAvailable|Cached|MemTotal|AnonPages|MemFree|Swap)'
Begin ad-hoc in-memory metrics snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:    1.864 seconds
192.168.66.80:9300   gauge    node_memory_AnonPages_bytes                                                  351571968.000000        +3092480
192.168.66.80:9300   gauge    node_memory_Cached_bytes                                                     805695488.000000              +0
192.168.66.80:9300   gauge    node_memory_MemAvailable_bytes                                              1219727360.000000        -3207168
192.168.66.80:9300   gauge    node_memory_MemFree_bytes                                                    552243200.000000        -3207168
192.168.66.80:9300   gauge    node_memory_MemTotal_bytes                                                  1900789760.000000              +0
192.168.66.80:9300   gauge    node_memory_SwapFree_bytes                                                  2206199808.000000              +0
192.168.66.80:9300   gauge    node_memory_SwapTotal_bytes                                                 2206199808.000000              +0

This is how to look at these figures:

alloc	size
MemTotal	1900M
MemFree	552M
Cached	806M
Anonymous	352M
%others%	190M
MemAvailable	1220M

MemTotal is what is available to Linux.
MemFree is what is actually free.
Cached is all of file backed memory (including in use).
Anonymous is uniquely allocation, mostly in use.
Others: MemTotal-(MemFree+Cache+Anonymous) leaves around 300M for kernel allocations and others.
The kernel thinks it can make 1003M available for use without the need to involve paging to the swap device.

These are dynamic classifications. The total memory size is 805M, and with the above distribution of memory, there is 382M available (from free, but also other memory that can be 'repurposed').

YSQL allocations

Now let's see how that works by making PostgreSQL allocate an array in PLpgSQL. The code for that:

set my.size to 1020;
set my.count to 1000000;
do 
$$
  declare
    array text[];
    counter int:= current_setting('my.count',true);
    size int:= current_setting('my.size',true);
  begin
    raise info 'Pid: %', pg_backend_pid();
    raise info 'Array element size: %, count: %', size, counter;
    for count in 1..counter loop
      array[count]:=repeat('x',size);
    end loop;
    raise info 'done!';
    perform pg_sleep(60);
  end 
$$;

Execute it in the following way:

Logon to PostgreSQL (psql)/YugabyteDB (ysqlsh)
Execute yb_stats --hostname-match 192.168.66.80:9300 --adhoc-metrics-diff --gauges-enable --stat-name-match 'memory_(MemAvailable|Cached|MemTotal|AnonPages|MemFree)', and wait for the message to indicate it has taken the begin snapshot: Begin ad-hoc in-memory metrics snapshot created, press enter to create end snapshot for difference calculation.
Execute the above anonymous PLpgSQL procedure with the memory counter adjusted for your available memory, and wait until it says 'done!'.
Press enter in the terminal that yb_stats has made it's first in-memory snapshot, so it takes another snapshot and shows the difference.

This is what it shows for me:

➜ yb_stats --hostname-match 80:9300 --adhoc-metrics-diff --gauges-enable --stat-name-match 'memory_(MemAvailable|Cached|MemTotal|AnonPages|MemFree|Swap)'
Begin ad-hoc in-memory metrics snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:   12.998 seconds
192.168.66.80:9300   gauge    node_memory_AnonPages_bytes                                                 1452355584.000000     +1057705984
192.168.66.80:9300   gauge    node_memory_Cached_bytes                                                     187269120.000000      -620449792
192.168.66.80:9300   gauge    node_memory_MemAvailable_bytes                                               150810624.000000     -1025363968
192.168.66.80:9300   gauge    node_memory_MemFree_bytes                                                     89874432.000000      -416792576
192.168.66.80:9300   gauge    node_memory_MemTotal_bytes                                                  1900789760.000000              +0
192.168.66.80:9300   gauge    node_memory_SwapCached_bytes                                                   7012352.000000        +7012352
192.168.66.80:9300   gauge    node_memory_SwapFree_bytes                                                  2154631168.000000       -51568640
192.168.66.80:9300   gauge    node_memory_SwapTotal_bytes                                                 2206199808.000000              +0

The approximate size for allocation for the anonymous PLpgSQL procedure is 1000000*1048=1,048,000,000. That is close to +1057705984. It clear the array is allocated from anonymous memory.
Most of the memory allocated to anonymous memory is removed from available memory.
620M is taken from Cached, 417M is taken from Free memory.
52M is paged out to the swap device.

The conclusion is that memory usage by YSQL processes mainly taken from Anonymous memory.

Tablet server allocations

What about the YugabyteDB tablet server?
For this, I created a small configurable PLpgSQL procedure with variables:

set my.tables to 8;
set my.rows to 4000;
set my.size to 1020;
set my.tablets to 1;
do
$$
  declare
    tables  int:= current_setting('my.tables',true);
    rows    int:= current_setting('my.rows',true);
    size    int:= current_setting('my.size',true);
    tablets int:= current_setting('my.tablets',true);
  begin
    --
    raise info 'Pid: %', pg_backend_pid();
    raise info 'Nr. tables: %, rows: %, textsize: %', tables, rows, size;
    raise info 'rowsize: %, total size: %',
      pg_column_size(1)+pg_column_size(repeat('x',size)),
      rows*(pg_column_size(1)+pg_column_size(repeat('x',size)));
    for table_counter in 1..tables loop
      raise info 'table: %/%', table_counter, tables;
      execute format('drop table if exists table%s cascade', table_counter);
      execute format('create table table%s (id int primary key, f1 text) split into %s tablets', table_counter, tablets);
      for row_counter in 1..rows loop
        execute format('insert into table%s (id, f1) values (%s, ''%s'')', table_counter, row_counter, repeat('x',size));
      end loop;
    end loop;
    raise info 'Done.';
  end
$$;

What this does is quite naively create tables and insert data into it (this is not highly optimised code).
However, the purpose of the above code is to create memtables and fill these up, because these will increase the tablet server memory footprint.

The results below are from a small 3-node YugabyteDB RF3 cluster, which means that despite the leaders of the tablets being distributed over the cluster nodes, each cluster node will get either a leader or a follower of a tablet.

I ran this after having created a new cluster on a freshly started first node, with an in-memory snapshot having been taken in this way, which is identical to the earlier YSQL level testcase:

yb_stats --hostname-match 80:9300 --adhoc-metrics-diff --gauges-enable --stat-name-match 'memory_(MemAvailable|Cached|MemTotal|AnonPages|MemFree|Swap)'

After the procedure has run and created and filled 8 tables, press enter to show the difference:

➜ ./target/release/yb_stats --hostname-match 80:9300 --adhoc-metrics-diff --gauges-enable --stat-name-match 'memory_(MemAvailable|Cached|MemTotal|AnonPages|MemFree|Swap)'
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:  138.111 seconds
192.168.66.80:9300   gauge    node_memory_AnonPages_bytes                                                  306294784.000000      +145600512
192.168.66.80:9300   gauge    node_memory_Cached_bytes                                                     691851264.000000       +76595200
192.168.66.80:9300   gauge    node_memory_MemAvailable_bytes                                              1310580736.000000      -152317952
192.168.66.80:9300   gauge    node_memory_MemFree_bytes                                                    759713792.000000      -229179392
192.168.66.80:9300   gauge    node_memory_MemTotal_bytes                                                  1900789760.000000              +0
192.168.66.80:9300   gauge    node_memory_SwapFree_bytes                                                  2206199808.000000              +0
192.168.66.80:9300   gauge    node_memory_SwapTotal_bytes                                                 2206199808.000000              +0

The server did not perform any paging, which can be seen by the non-changed Swap statistics.
Memory Available decreased, which is logical, because the tablet server allocates memory for the memtables.
The memory was taken from Free Memory, which is also logical, because just after startup, the tablet server and other processes haven not paged in a lot of memory yet.
The Cached Memory statistic increased somewhat. This is because the new tables have their memtables written in memory only, but still actual writes are performed for the WAL to guarantee persistency, and these will need cached pages.
The amount of Anonymous Memory did increase the most. This is the tablet server increasing its memory to facilitate the memtables mostly, and all kinds of surrounding allocations with it.

Conclusion

In most cases, whenever YSQL processes or the tablet servers get active, they can allocate memory. The memory that is allocated is mostly Anonymous memory. Therefore, the anonymous memory statistic is a good indicator of YSQL and tablet server memory usage.

Please mind that because YugabyteDB relies on operating system caching for IO, just like PostgreSQL does, it needs to have a reasonable amount of 'Cached' memory too. The hard part is to understand what 'reasonable amount' is; however: there needs to be an amount that is "considerable" to function as a buffer for reads and writes.