This is an ongoing series of posts about linux memory specifics. This post is about swapping.
One of the things that I have seen a lot of times is the situation where the memory allocations of a server were carefully planned and configured, and still the server started allocating swap, for no apparent reason. This is what this blogpost is about. Careful memory management is key for any database to perform well, and PostgreSQL is not an exception to this.
So... does linux just "happen" to do that, and it is an act of randomness? The answer to that is no: linux doesn't do random things; it tries to do nothing, and if it must, it tries to do the bare minimum. So there is a reason for it!
Let's do some tests. I got a small virtual machine in my lab:
$ eatmemory-rust/target/release/eatmemory -q
total memory : 1860 MB
available memory : 1554 MB
free memory : 1347 MB
used memory : 142 MB
total swap : 2158 MB
free swap : 2158 MB
used swap : 0 MB
(I am using my eatmemory tool, which you can download and compile yourself)
As you can see, the total amount of memory that is visible to the kernel is 1860M (total virtual machine 'physical' is memory 2G), only 142M is used, and 1554M is 'available'.
Like explained in what is free memory in linux, you have to look for available memory, the purpose of free memory is to have a minimal amount of memory free for direct usage. I also got 2158M of swap, and none of that is currently used.
Swap, like described here is a way to create a second stage memory area for memory pages for the purpose of being able to overcome over-allocation of main memory. It increases the virtual memory size, not the physical memory size. Also the naming is 'swap', but strictly speaking swapping is the act moving an entire process space to the swap device, which linux doesn't do. Linux moves individual pages to the swap device, which is called 'paging', but for historical reasons refers to this as 'swapping'.
The first thing to notice is that out of the total 2G, the actual amount of memory for userspace applications is around 1.5G, which is the available memory. Obviously this is a small VM, and the relative amount the kernel takes will be much smaller on real-life sized and thus bigger linux instances.
Naive application memory sizing
It is not uncommon for people to use the total amount of memory dedicated to the virtual machine (2G here) for memory calculations for usage of the virtual machine. This is a mistake, and if all of that memory is used, the system will swap:
$ eatmemory-rust/target/release/eatmemory -s 2000 -v
total memory : 1860 MB
available memory : 1570 MB
free memory : 1355 MB
used memory : 137 MB
total swap : 2158 MB
free swap : 2158 MB
used swap : 0 MB
creating vec with size 2000 MB
pointer to hog: 0x7f5436fff010
allocating vec for -1 MB (-1 means creation size)
total memory : 1860 MB
available memory : 25 MB
free memory : 80 MB
used memory : 1735 MB
total swap : 2158 MB
free swap : 1586 MB
used swap : 571 MB
done. press enter to stop and deallocate
This system has 2G memory configured for the virtual machine, but only 1570M available inside it to the operating system. I created a vec of 2000M (the total amount of memory configured for this virtual machine), and then paged it in by writing zero into it. Because there only was 1570M memory available to the operating system, this over-allocated memory, so the linux kernel had to move 571M pages to the swap device to facilitate the 2G allocation.
So: the first reason for seemingly random swapping is naive sizing of application memory. Please mind I created a very absolute case here: in real life memory will be allocated (as virtual memory size, VSZ), but not yet paged in, which happens over time, for which the time it takes is dependent on usage, and thus can push the usage over the limit at a seemingly random time. The actual allocation for a process is the resident set size (RSS), which can be optimised by sharing pages using copy on write (COW), many process based databases take advantage of this.
Exact application memory sizing
But that is naive and simplistic. How about looking at the available memory, taking what is actually available? Please mind I removed the in-use swap allocations:
$ eatmemory-rust/target/release/eatmemory -q
total memory : 1860 MB
available memory : 1573 MB
free memory : 1582 MB
used memory : 138 MB
total swap : 2158 MB
free swap : 2158 MB
used swap : 0 MB
If you're wondering how I cleaned up swap: this can be done by restarting the linux instance, and not over-allocate (obviously), but this can also be done by disabling and enabling swap (swapoff -a; swapon -a). Use this with care! Swapped pages are in use pages, if not it wouldn't make sense to perform the swapping. When swap is disabled, the kernel will have to move the swapped pages back into main memory. So disabling swap can take time, which is the time it takes for linux to move pages from swap back into memory. Also, this can lead to a situation where this move will over-allocate main memory again, and then gets stuck (first), because without swap, there is nowhere else to go (and will have to perform OOM killing probably if memory is exhausted).
The overview with eatmemory shows available memory is 1573M that suggests 1573M is available (you can't beat that logic, right?). Let's allocate 1500M:
$ eatmemory-rust/target/release/eatmemory -s 1500 -v
total memory : 1860 MB
available memory : 1573 MB
free memory : 1582 MB
used memory : 138 MB
total swap : 2158 MB
free swap : 2158 MB
used swap : 0 MB
creating vec with size 1500 MB
pointer to hog: 0x7f5f3e3ff010
allocating vec for -1 MB (-1 means creation size)
total memory : 1860 MB
available memory : 40 MB
free memory : 55 MB
used memory : 1677 MB
total swap : 2158 MB
free swap : 2157 MB
used swap : 1 MB
done. press enter to stop and deallocate
Despite the allocation being sized for available memory, and actually was rounded up to be a little less, a tiny amount of swap is already used. What is going on?
The problem with the above logic is that the processing of the linux operating system, as well as application processing, is considered to be something that does not use memory. I think it's reasonable and logical that for the process to start, linux needs to facilitate the request of starting the eatmemory executable, and the executable needs to load its pages, allocate runtime memory for it before it can even start actually allocating a dedicated memory area.
In other words: if you do take the operating system current allocation into account, but do not calculate memory for the operating system to perform administration, it will still take it, and if that exceeds main memory size, it will perform swapping.
Leaving room for linux and still swapping
Now with the above considered, let's free up the system once again, and perform an allocation where the operating system overhead is considered. So how about allocating 1200M:
$ eatmemory-rust/target/release/eatmemory -s 1200 -v
total memory : 1860 MB
available memory : 1588 MB
free memory : 1612 MB
used memory : 138 MB
total swap : 2158 MB
free swap : 2158 MB
used swap : 0 MB
creating vec with size 1200 MB
pointer to hog: 0x7f4210fff010
allocating vec for -1 MB (-1 means creation size)
total memory : 1860 MB
available memory : 356 MB
free memory : 381 MB
used memory : 1370 MB
total swap : 2158 MB
free swap : 2158 MB
used swap : 0 MB
done. press enter to stop and deallocate
After the allocation and paging it in, swap is still empty. The goal is reached! Whoohoo!
With the above allocation still active, let's simulate usage of, let's say: a backup, and create a big file:
$ dd if=/dev/zero of=/tmp/tempfile bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.24035 s, 845 MB/s
$ dd if=/tmp/tempfile of=/tmp/tempfile_
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 3.98084 s, 263 MB/s
And then check memory again:
$ eatmemory-rust/target/release/eatmemory -q
total memory : 1860 MB
available memory : 517 MB
free memory : 77 MB
used memory : 1191 MB
total swap : 2158 MB
free swap : 1959 MB
used swap : 199 MB
What?! Now the allocated memory is carefully and conservatively sized, which was validated right after the allocation and paging in the memory, and after performing "some work", my system still shows swap being used!
Yes. This is intended behaviour. When performing file management via buffered IO operations, the operation uses the page cache which adds to memory pressure.
Memory in linux is managed via a LRU (least recent used) list. That means it's keeping track of how often and how recent a page is used, and orders memory pages on that list, which includes linux page cache (Cached in /proc/meminfo) pages. So performing perfectly normal file operations, such as copying a file does count in memory allocation equal to an application explicitly allocating memory, but obviously only for the duration of the operation.
That means that pages of an anonymous memory allocation, such as the above allocation of the eatmemory tool can get a lower rank on the LRU list than pages used for the linux page cache, and by creating memory pressure using file operations, these can get swapped out.
The amount of swapped-out pages can be seen in the /proc/PID/smaps file per memory allocation:
$ less /proc/$(pgrep eatmemory)/smaps
...
7f4210fff000-7f425c000000 rw-p 00000000 00:00 0
Size: 1228804 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 1005568 kB
Pss: 1005568 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 1005568 kB
Referenced: 1005568 kB
Anonymous: 1005568 kB
LazyFree: 0 kB
AnonHugePages: 888832 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 223236 kB
SwapPss: 223236 kB
Locked: 0 kB
THPeligible: 1
VmFlags: rd wr mr mw me ac sd
...
This is the anonymous allocation of the 1200M (Size). If you look at the 'Swap' figure, you can see the amount of memory that is moved to swap for this allocation.
Does this makes sense? Is this logical? Yes: when I ran 'eatmemory' for the allocation, it created a vec ("array"), and filled it with the value 0, making the memory getting allocated. However, after that operation, the memory pages are not touched anymore(!)
When creating a file, and then copying that same file I touch the cached pages of the file more than once, pushing them higher up the LRU list, and thus when the system gets under memory pressure because of the buffered IO operation, it chooses my anonymous memory allocation pages to be paged out, because these are less used.
When is swap an issue?
Does the usage of swap mean my main memory is too small and thus indicate a performance issue?
These are two questions.
An answer to the first question 'is my main memory too small': If linux chooses to swap pages, it means it solved memory pressure by performing the swapping. So at the time of the swapping, there was too little memory available. Linux can solve the issue of memory pressure by reducing file-backed pages such as non-used cached pages or removing memory mapped pages (I do believe these count as cached too), or by swapping anonymous pages, because these cannot be discarded.
Linux provides a parameter that controls the ratio of the two using the 'swappiness' parameter. A higher value favours swapping, and thus anonymous pages being moved to swap, a lower value favours reducing file backed pages. Despite this sounding like a magical parameter, in most cases its best left untouched, and solve the issue, not massage the solution.
An answer to the second question 'does this indicate a performance problem': allocated swap indicates a past moment of memory pressure, for which pages have been swapped, which is something different than a performance problem.
The pages that have been swapped out could potentially never be touched again, for example if they contain the executable code to startup a process, and thus are perfectly okay being in swap, because they are never used again. This is also the reason that swap remains being allocated despite no current memory pressure: linux tries to do the bare minimum to maximise CPU for you to be available for useful work. As long as there's no need for the page in swap, there is no reason to perform the work of reading it back into memory.
This means that swap and swap usage only is a problem if the reason for swap usage is still ongoing, which is memory pressure, with the memory pressure being high enough to influence performance. Swap usage is an indicator, but not the problem. An indicator for memory pressure is to look if swap in and out is actively ongoing. There are lots of ways to do that, a common simple way is to look at the 'si'/'so' (swap in/swap out) statistics with the vmstat utility.
Obviously human nature is that swap usage means memory issues, and thus if memory issues are resolved swap should be reduced. This is more a knee-jerk reaction than a careful understanding of the mechanics behind swapping. It's not uncommon to see a certain percentage of swap usage being a trigger for monitoring requiring attention and action, whilst I hope this explanation makes it clear the attention is good: swap reaching a certain percentage means memory pressure, but as long as a system is currently not performing active swap in and swap out, there is nothing which can be improved at that time, and trying to reduce in-use swap is really an epitome of not understanding swap. Of course changing the earlier situation that was the cause of the swapping is the correct thing to do.
Another solution to page cache pressure
There is another solution to the problem of file management increasing memory pressure. The increased memory pressure because of file management is caused by the mandatory use of the page cache for buffered reads and writes. Linux has no automatic mechanism reduce the pressure caused by file management if there already is memory pressure.
However the page cache pressure can be reduced by letting the reading and writing explicitly be executed without page cache usage, which is called 'direct IO'. This is a specific and specialistic configuration.
The linux 'dd' command has switches to enable direct IO:
$ dd if=/dev/zero of=/tmp/tempfile bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.24035 s, 845 MB/s
$ dd if=/tmp/tempfile of=/tmp/tempfile_ iflag=direct oflag=direct
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 703.468 s, 1.5 MB/s
Now looking at memory usage:
$ eatmemory-rust/target/release/eatmemory -q
total memory : 1860 MB
available memory : 381 MB
free memory : 428 MB
used memory : 1368 MB
total swap : 2158 MB
free swap : 2158 MB
used swap : 0 MB
No swap used, because the file management didn't increase memory pressure because of direct IO.
Of course there is a down side to this; nothing comes for free: by eliminating the caching in the operating system, the performance of performing file operations is reduced relative to the latency difference between memory write and block device write.
Top comments (0)