The last time I was wondering on how to restrict program memory consumption. As a trophy from my previous journey I've got libmemrestrict -- a library that implements wrappers over allocation functions like
malloc to account memory usage - and ptrace-restrict -- ptrace-based tool that intercepts
mmap system calls to do the same thing.
But wait, why bother trying to work in memory restricted environment? Is that really such a common problem? Can you remember the last time when OOM killer shot your application? Do you always think about memory consumption when you're programming? I believe you don't because memory is cheap and if your application is starving -- just add a few gigs of RAM!
However, you can't add more and more RAM infinitely and it's not because you don't have an infinite amount of memory. Today's software has to deal with things like Big Data, where you just can't fit your input into an array. You need to distribute data between RAM, storage, and network. You need algorithms or at least techniques to handle data in that way.
So here I am, trying to get my hands dirty on such problems, though in somewhat trivial fashion, with a popular question -- how to sort 1 million integers (4 MiB of data) in 2 MiB of RAM? This could be generalized to the problem of sorting data that doesn't fit in RAM and this is what I will try to solve here.
The program should output the sorted array on stdout in text format.
Program should measure it's working time and output it to stderr1.
The program should work with the amount of memory at least two times smaller than file size (e.g. 2MiB of RAM for 4MiB file). To check this, we will use
Despite having such nice restriction tools, we will not use them for some methods. For example, for
mmap it will be pointless - we need to restrict physical RAM usage, not virtual (see details in the corresponding section).
Programs will be tested to solve the original problem (sort 4 MiB in 2 MiB). Also, I will run it in a virtual machine with 128 MiB of RAM to sort 500 MB (125 millions of 4 bytes) of integers.
Let's try to sort it in a naive way to get a baseline and see how much RAM we will use. Here is an implementation that simply reads the whole file into an array of integers and invokes
qsort on it (as a bonus you can find there a simple dynamic array implementation via
We'll try to sort 4MB of data with this program. With no restrictions it works:
$ ./naive 4M.bin > /dev/null 4000000 bytes sorted in 0.323146 seconds
but it's not interesting. When we try to restrict it in 2 MiB we'll get a segmentation fault.
$ LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted Segmentation fault
However, if we increase2 limit to 4 MiB to hold all the data we will still fail.
$ MR_THRESHOLD=5000000 LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted Segmentation fault
Apparently, something is trying to get even more memory and, obviously, it's
qsort. Let's see how much memory it wants with the help of Valgrind's massif:
$ valgrind --tool=massif ./naive ints $ ms_print massif.out.10676
Here is a nice picture:
MB 8.819^ :::::::::::::::::::::::::::# | : # | : # | : # | : # | : # | : # | : # | : # | :::::::@ #:::::::::::::::::::::::: | : @ # | : @ # | : @ # | : @ # | : @ # | @@@@@@: @ # | @ : @ # | @ : @ # | :::@ : @ # | ::: @ : @ # 0 +----------------------------------------------------------------------->Gi 0 1.721
You may see a few doubling allocations up to 4 MiB -- that's my dynamic array and then four more MiB for
qsort. Here are some stats:
-------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 21 173,222,581 5,247,504 4,000,568 1,246,936 0 22 173,223,802 5,246,920 4,000,000 1,246,920 0 23 173,226,655 5,247,504 4,000,568 1,246,936 0 24 173,229,202 5,246,920 4,000,000 1,246,920 0 25 173,229,311 9,246,928 8,000,000 1,246,928 0 26 869,058,772 9,246,928 8,000,000 1,246,928 0 86.52% (8,000,000B) (heap allocation functions) malloc/new/new, --alloc-fns, etc. ->43.26% (4,000,000B) 0x400A26: readfile (in /home/avd/dev/cs/sorting/external/naive) | ->43.26% (4,000,000B) 0x400ACD: main (in /home/avd/dev/cs/sorting/external/naive) | ->43.26% (4,000,000B) 0x35D40383F7: qsort_r (in /usr/lib64/libc-2.18.so) | ->43.26% (4,000,000B) 0x400B3D: main (in /home/avd/dev/cs/sorting/external/naive) | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
4 million (useful) bytes used by me and 4 more million bytes are used by
qsort_r. There is also 1.2 MB extra-heap used by massif.
Looks like in this case
qsort behaves as O(n) for space complexity.3
Ok, is it able to sort 500 MB in 128 MiB of RAM?
$ ./naive 500M.bin > /dev/null Segmentation fault
Of course, not. As for performance:
$ ./naive 4M.bin > /dev/null 4000000 bytes sorted in 0.322712 seconds
So, naive sorting works well when not restricted, fails when restricted and
qsort for some strange reason requires O(n) memory4.
mmap is a nice and hacky way to sort a large amount of data in the small amount of RAM. By using
mmap you shift the burden of balancing data between RAM and swap space to the operating system kernel.
This is how it works:
mmapwhole file with data into memory.
mmapyou get the pointer to your data.
- You invoke the sorting algorithm on data under
And that's it! From now on you won't exceed physical memory limits even though you are sorting file much larger than your RAM. To understand why is it working, you need a little knowledge about memory management in operating systems.
Every program represented by a process has its own private and isolated from other processes virtual address space. Length of address space is bound by CPU address bus width, e.g. for 32 bit CPU, it's 232 which is 4 GiB.
All allocations that happen in the process via
new or whatever else is a virtual memory allocations. That allocated virtual memory is mapped to physical memory by the kernel memory management subsystem. Usually, this is done in lazy mode. It means that whenever process asks for some amount of memory kernel will satisfy this request immediately, but will not do actual allocation -- in this case, we say that virtual memory page is not mapped (to physical page frame). Whenever such unmapped page is accessed (for example it's written) MMU will generate the "page fault" exception that kernel will handle by mapping virtual memory page to a physical page frame. From now on a page is mapped and writing by virtual address within that page will be translated by MMU to the physical address in RAM.
On the other hand, if you remember that virtual address space is bound only by CPU address bus size, you might get into the situation when the program can allocate much more memory than available in RAM. For example, your 32-bit system has only 256 MiB of RAM, but the process can allocate and use 1 GiB of memory. In this case, memory pages can't be held in RAM and going to be swapped, i.e. evicted from RAM to backing storage like a hard drive. Whenever the process requests that swapped page, kernel will fetch it from disk and load into RAM (possibly by replacing some other page).
As you can see kernel can do a pretty good job of distributing data between RAM and disk, so it's very natural to exploit this facility in our task. When we do
mmap of our file, kernel will reserve a range of virtual addresses for our file that won't be mapped. Whenever we will try to access them by changing bytes, moving array pivot or anything else, kernel will fetch data from input file on disk into the RAM. Whenever we will exhaust physical memory, kernel will evict some pages to swap space. That way we will be able to balance our data between the original file on disk, RAM and swap space.
The only restrictions in that method is virtual address space, which is not much a restriction (4 GiB for 32bit system and 256 TiB5 for 64 bit system), and swap space that can be huge because hard drives are cheap today.
Also, note that because
mmap load the whole file into virtual memory we can't use
ptrace-restrict because they account for virtual memory itself. If we try to restrict sorting 100M of data in 10M of virtual memory,
mmap program will fail:
$ qemu-x86_64 -R 10M ./mmaped 100M.bin mmap stack: Cannot allocate memory
That's not a surprise because
mmap loads whole file in virtual memory and then kernel distribute it between actual physical memory and swap. So, we need at least 100M of virtual memory (plus some extra space for
qemu itself) to map file into memory.
That's why for this sorting method I will use a virtual machine with 128 MiB of memory. Here is my mmap sorting program. And, you know what? It works like a charm!
$ free -m total used free shared buffers cached Mem: 119 42 76 0 4 16 -/+ buffers/cache: 21 97 Swap: 382 0 382 $ ll -h 500M.bin -rw-r--r-- 1 root root 477M Feb 3 05:39 500M.bin $ ./mmaped 500M.bin > /dev/null 500000000 bytes sorted in 32.250000 seconds
Here is the
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3167 root 20 0 480m 90m 90m R 84.6 76.4 1:18.65 mmaped
As you can see we use 500 MB of virtual memory6 while actual resident memory is only 90 MiB.
If we look at more detailed
vmstat output while sorting 500 MB we'll see how kernel is balancing between swap, disk cache, buffers and free memory:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 77776 2120 15104 1 27 710 971 24 34 3 1 95 1 1 1 0 2060 488 90068 1 27 785 1057 25 37 3 1 95 1 1 0 928 3400 60 89744 1 27 799 1092 25 38 3 1 94 1 0 2 1908 1928 212 92040 1 27 852 1174 26 40 4 1 94 1 0 2 3436 2360 280 93056 1 27 911 1282 28 42 4 1 94 2 0 0 5272 3688 196 94688 1 27 1066 1471 31 48 4 1 93 2 0 0 5272 3720 204 94700 1 27 1064 1469 31 48 4 1 93 2
In the beginning we had ~70 MiB of free memory, empty swap and some memory allocated for I/O buffers and disk cache. Then, we did a
mmap and all that memory had gone to disk cache. When the free memory were exhausted, kernel had started to use swap space and we can see it's increasing along with increasing I/O load. And we end up in situation where almost all of the memory is dedicated to disk cache, though it's OK because disk cache pages are first victims to steal from when we need memory for application.
So, sorting with help of
mmap is a neat hack that requires a simple understanding of memory management, but gives you a quick and easy solution to handle a large amount of data in the small amount of RAM.
All right, suppose you can't use
mmap, for example, you want to sort 5 GiB file on 32-bit system. What would you do?
There is a well-known and popular way to accomplish this named external merge sort. Motivation is simple -- if you don't have enough memory to hold your data you just use some external storage like a hard disk. Of course, you have to work with your data in a piece by piece fashion because you still have only a small amount of main memory.
External merge sort works like this:
- You split your data into chunks of available memory buffer size.
- Each chunk is sorted in a memory buffer and written to external storage.
- Now you have
filesize/buffersizechunks on your storage.
- Finally, you read
buffersize/#chunkspieces from each chunk to merge them in buffer and output to the result file.
I made some trivial, absolutely-not-optimized-in-any-sense implementation that simply works:
$ LD_PRELOAD=./libmemrestrict.so ./external-merge 4M.bin 1048576 > /dev/null 4194304 bytes sorted in 0.383171 seconds
We've sorted in 2 MiB of memory using 1 MiB buffer.
Now let's sort 500MB. First, disable swap -- we're handling chunks swapping by hand:
$ swapoff /dev/vda5
$ echo 3 > /proc/sys/vm/drop_caches
Check free memory:
$ free -m total used free shared buffers cached Mem: 119 28 90 0 0 6 -/+ buffers/cache: 21 97 Swap: 0 0 0
We'll use 50M as a buffer -- 10 times smaller than file size.
$ ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 120.115180 seconds
Holy crap, two minutes! Why is that? Well, the main killer of performance here is I/O. There are 3 things that cause lots of I/O and slow things down. On the split phase we're sequentially reading file slipping it through a small buffer. On the merge phase we're constantly opening and closing chunk files. And last but not least is output -- on merge phase we output whole buffer (50 MB, 12.5 millions of integer) to stdout that, despite redirecting it to
/dev/null, creating a heavy load. We may turn it off. All in all, in
mmaped we output everything in a single pass in the end of program and doesn't account it in performance counters. So if we turn off output we'll run in ~90 seconds - 25% faster.
$ ./external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 87.140000 seconds
As for memory consumption -- it's fine. If we try to run it under massif we'll see that peak consumption is our buffer size plus some extra heap.
-------------------------------------------------------------------------------- Command: ./external-merge 500M.bin 50000000 Massif arguments: (none) ms_print arguments: massif.out.17423 -------------------------------------------------------------------------------- MB 47.75^ ::::: |#::::::@:::::::::::@:::::::::@:::@::::@::::@::::::::@::::@::::@:::@ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ 0 +----------------------------------------------------------------------->Gi 0 332.5 Number of snapshots: 98 Detailed snapshots: [4 (peak), 10, 20, 29, 32, 35, 38, 45, 48, 54, 64, 74, 84, 94] -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 119,690 584 568 16 0 2 123,141 50,004,496 50,000,568 3,928 0 3 4,814,014 50,005,080 50,001,136 3,944 0 4 4,817,234 50,005,080 50,001,136 3,944 0 99.99% (50,001,136B) (heap allocation functions) malloc/new/new, --alloc-fns, etc. ->99.99% (50,000,000B) 0x400FA2: external_merge_sort (in /root/external-merge) | ->99.99% (50,000,000B) 0x40128E: main (in /root/external-merge) | ->00.00% (1,136B) in 1+ places, all below ms_print's threshold (01.00%)
If we restrict memory to 50 MB7 we'll see that it works:
$ LD_PRELOAD=./libmemrestrict.so MR_THRESHOLD=51000000 ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 87.900000 seconds
Ok, memory consumption is fine, but performance is not that good. Recall that
mmaped has done this in 32 seconds. Let's see how we can improve our 90 seconds.
Let's profile this nice program with gprof. Build an instrumented binary
$ gcc -pg -g -Wall -Wextra external-merge.c -o external-merge-gprof
And invoke this program multiple times accumulating statistics. To do so we'll use nice script from my article on gprof. Here is the output:
% cumulative self self total time seconds seconds calls Ts/call Ts/call name 81.98 432.87 432.87 compar 18.17 528.82 95.95 print_arr 0.00 528.82 0.00 1100 0.00 0.00 form_filename 0.00 528.82 0.00 100 0.00 0.00 merge 0.00 528.82 0.00 100 0.00 0.00 save_buf 0.00 528.82 0.00 10 0.00 0.00 external_merge_sort 0.00 528.82 0.00 10 0.00 0.00 split
Most of the time we have spent in sorting and printing. But also don't forget that gprof won't show you time spent in syscalls and I/O.
I can think of 2 things to improve here:
- Tune external sorting with multithreading and I/O tricks
- Think about a different sorting algorithm
So, generic external merge sort is a quite simple idea to sort a bunch of data in small RAM, but usually without tuning and improving it's slow.
One of the things that we can try is to use multiple threads on the split and merge phases of our external sort. However, in this case, it's not a really great idea.
Using multithreading on split phase doesn't make sense because there is a single buffer that can hold the data. But we may try to advise kernel on how we'll read data. There are 2 functions for that:
readahead, though it's Linux specific.
These functions will inform the memory management subsystem on how we'll read the data. In our case, we read it sequentially and thus it would be nice to see file content in the page cache.
On a merge phase, we could avoid constant
close of chunk files by creating a dedicated thread for each of the chunks. Each thread will keep its file open and will fill buffer at thread offset. When buffer is filled it will be sorted8 and output. Also, we will employ
readahead for each thread.
Here is tuned and multithreaded version of external merge sort.
OK, so as I said, multithreading here is not a great idea. All these threads things are nice and dandy, but for single core CPU it's not showing any effect:
$ ./mt-ext-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 117.380000 seconds
It is the version with printing. And here is the version without printing:
$ ./mt-ext-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 91.040000 seconds
You may think that it's because we have hard times scheduling dozen of threads on a single CPU. Alright, let's compare it with other methods on the quad-core machine (Intel Core i7-3612QM CPU @ 2.10GHz):
$ ./naive 500M.bin > /dev/null 500000000 bytes sorted in 23.040499 seconds $ ./mmaped 500M.bin > /dev/null 500000000 bytes sorted in 23.542076 seconds $ ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 39.228695 seconds $ ./mt-external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 41.062793 seconds $ ./external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 28.893745 seconds $ ./mt-external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 28.368976 seconds
Now with a 100 chunks (100 threads):
$ ./external-merge-no-output 500M.bin 5000000 > /dev/null 500000000 bytes sorted in 27.107728 seconds $ ./mt-external-merge-no-output 500M.bin 5000000 > /dev/null 500000000 bytes sorted in 28.558468 seconds
You can see no changes in performance between
mt-external-merge. Why is that? Because for most cases multithreading is not a solution for I/O bound applications. Still, there are situations when spawning some threads will speed up things:
- Your execution threads are independent
- Your I/O resource can work in parallel, e.g. it's RAID
Examples here are some graphics application that renders an image in independent areas or scientific application that makes some heavy calculations for multiple results.
In multithreaded external sort, threads are dependent -- you have to wait for the main thread to sort buffer before you can do a further reading from the chunk. Also, reading for a slice is done much faster than sorting the whole buffer, so most of the time threads are waiting for the main thread to finish.
That's why multithreading won't help us, so we need to look for other ways to improve.
Now, let's try to use some sorting algorithm that would perform better than QuickSort assuming we know the distribution of data. Like we know that we're sorting integers, so why not play around this? There are few sorting algorithms designed specifically for the special type of data can fall in either of 2 groups:
- Don't use compares.
- Don't even require to load input array in memory.
Such algorithms are known to work better than O(n log(n)) - a lower bound for comparison based sort like QuickSort. Of course, not every algorithm will work for us because we have memory restriction. For example, radix sort and other kinds of bucket sorting won't help us because it requires additional memory for buckets, though there are implementations of in-place Radix sort. Anyway, such implementations require reading whole data set multiple times for each radix -- 32 times for binary-radix sort and that's way too much. Also, deep recursion that arise from MSD radix sort is a memory consumption itself. That's why I came up to using counting sort.
If we know that our data are big, but its range is small, then we can use counting sort. The idea is dead simple -- instead of holding data in memory we will hold an array of counters. We'll read our data sequentially and then increment the relevant counter. The most important is that counting sort time complexity is linear and space complexity is proportional to a range that is usually small.
A simple implementation works with consecutive range of integers from 0 to some
N. There is an array size of range, where integers correspond to array indices. Here is my implementation on github performing really well without any tuning9:
$ ./counting-array 500M-range.bin 1000000 > /dev/null Range is 1000000 500000000 bytes sorted in 3.240000 seconds
Yep, half gig of data sorted in 3 and a half seconds on 128 MiB machine with single CPU. Compare it with
$ ./mmaped 500M-range.bin > /dev/null 500000000 bytes sorted in 76.150000 seconds
23 times faster!
But anyway, you should be aware of restrictions that counting sort implies: only integers (or equivalent) from small and consecutive range. I've also tried to develop counting sort for non-consecutive range with hash tables and binary search trees -- here is the code. However, its performance is pretty bad and, unfortunately, I still can't explain this.
Anyway, we can go even further and assume that numbers in our range are unique. Then, counter for value might be only in 2 states -- present or not, so we could use a single bit for the counter. This will enormously compact our array, although in that case, we don't even need an array. We could store each number as a single bit, thus transforming our data into a bit vector -- while reading a file, set Nth bit if there is an integer N in a file. After bit vector is formed, read it and write to output file numbers that correspond to bits that are set.
Bit vector solution requires even more attention because, despite its seem compactness, you might violate your restrictions, for example, to sort array of numbers from whole integers range (232) you would need a 1 bit for every integer, which is 4294967296 bits = 536870912 bytes = 512 MiB. In my case, I have only 128 MiB, so it won't work for me. But there are cases where you will benefit like a boss -- read a great story from "Programming Pearls" by Jon Bentley.
A moral here is "knowing your data is extremely useful".
For the last 5 months of writing this article I did a lot of things -- I've implemented a dozen of programs, come up with a few good ideas, failed with many more and still, I have things to try and/or fix. Now it's time for conclusions.
The simple problem of sorting data in small memory revealed a whole class of peculiarities we don't usually think of:
- Common widely used algorithms are not suitable for any problems.
- Dynamic debugging and profiling is extremely useful and demonstrative.
- I/O is a bitch unless you fully rely on the kernel.
- Multithreading is not a silver bullet for performance.
- Know your data, know your environment.
As for sorting here is a result table:
|Test case |Naive | mmap and | External | Multithreaded | Counting | | |QuickSort | QuickSort | merge sort | external merge sort | sort | |----------|:--------:|-----------|------------|---------------------|----------| |4 MiB in | Segfault | N/A | 0.38s | 0.41s | 0.01 | |2 MiB | | | | | | |----------|----------|-----------|------------|---------------------|----------| |500 MB in | Segfault | 32.25s | 87.14s | 91.04 | 3.24 | |128 MiB | | | | | |
The bottom line is "Know your data and develop a simple algorithm for it".
Thank you for your attention.
Sorting a million 32-bit integers in 2MB of RAM using Python - simple external sort in pythonic way with generators,
heapq. Also, has nice explanation of buffering advantages.
External sorting of large datasets -- External sort implementation by Andrew Wang with performance measurements.
Efficiently sorting datasets bigger than memory -- Nice Reddit thread regarding Andrew Wang's article showing some different methods along with pros and cons.
Cracking the Oyster (Column 1 of Programming Pearls) -- Amazingly written story on doing the large sorting. Discussed the bit vector idea in great details.
Multithreaded File I/O - How multithreaded I/O behaves on a single disk and RAID5 array.
Algorithm Improvement through Performance Measurement: Part 2presents in-place radix sort implementation.
Parallel sorting algorithms -- a comprehensive list of resources regarding sorting on multinode computer systems.
- We can't simply launch program under
timeutility because it will include time for reading file into memory and time for outputting to console. [return]
- libmemrestrict gets a configuration from environment. Common practice for libraries, for example
LD_LIBRARY_PATHare well-known environment variables for ld-linux.so, and there is also less known, but extremely useful
LD_DEBUGenvironment variable for linkage debugging. [return]
- It's strange because qsort is in-place and uses optimizations suggested by Robert Sedgewick to, among others, guarantee O(log n) space complexity. You can dive into glibc qsort implementation.[return]
- There is a strange behaviour though. For example, if I restrict memory for 5.3 MB it will work and not require O(n) memory. I'm still investigating on this. [return]
- Note that MiB is 220 and MB is 106 = 1 million. So 500 MB = 500 000 000 bytes which is 500 000 000 >> 20 = 476 MiB. [return]
- Plus extra 500 KB for temporary strings holding chunks paths. [return]
- In this way it's kind of N-way merge. [return]
- Second argument is a buffer size in elements. Buffering improves performance drastically because it doesn't read from file by 4 bytes. [return]