Mapping the Way

Part 6 in a series on me trying to sort through a database design. Part 5 is here, and the beginning is here.

Buffer Limits

Java Buffers have a fundamental 2GB limit on their size. This occurs because buffers were introduced when Java and the hardware and operating systems that it ran on were mostly 32 bit systems. The int type is a 32-bit value and it was used for buffer sizes, and for addresses within the buffer. A 2GB buffer took up the entire addressable memory of a Java process on a 32-bit system, so even if a computer had more RAM than this, it still wasn't available to a single process. (32 bits can address 4GB, but the int type is signed and positive numbers only go to 2GB. This limit was beneficial, since the operating system needs to be mapped into the address space, as do the loaded libraries, the Java runtime, the stack, and any existing heap).

When 64 bit systems started to become the norm in the mid 2000s, Java processes were able to use far larger buffers, but the buffer API had already been fixed at 32 bit addressing. That said, it was now possible to build 2GB buffers without the concern of using up all the addressable space for a process. Indeed, if you had the RAM and/or swap space, you could create an array of 2GB buffers in your process. Despite having the ability to allocate more than 2GB, unless you had a need for processing large blocks of information in memory (such as high-resolution images or video) this you would not want to allocate so much at once.

However, this opened up a new possibility for memory-mapped files.

Memory Mapped Files

Back in 2001, the JDK 1.4 Beta was released with the "New I/O" packages. This introduced a feature to Java that had previously only existed in the host operating systems.

I'm going to do a hand-wavy description of read/write operations vs memory mapped files. I'm writing this after midnight, so please be nice if I get anything a bit wrong. Some things have changed over the decades, but the basics are still there.

In binary systems like C, normal file I/O is done by calling the C library function write() on a block of memory. The C library issues a system call to write that memory to disk*. The kernel code that implements the write operation copies the contents of this buffer into memory that belongs to the operating system (kernel memory) and this gets queued up to go to the disk controller. Then the kernel code returns with the status (which is hopefully a "success" value).

* Years ago, I always wanted to know what making a system call actually was. It involves putting appropriate parameters for the call into CPU registers (sometimes numbers, but often a memory address for data), setting the number of the appropriate call in one of the registers, and saving the current instruction pointer and stack pointer. It then issues an instruction that tells the CPU to switch to a privileged mode and execute kernel code. Once upon a time, this was done by issuing a software interrupt. Then with the i586 the `sysenter/sysexit` operations were used. Shortly after that, `syscall` was introduced to replace `sysenter`. But it's all about making the CPU change mode and execute code at a fixed location. Once this new mode has started, the CPU will only run code that belongs in the OS kernel, and it can access hardware directly. Upon finishing, it will reload the saved address and continue from where the call was made. Important to note is that there is a cost to switching between modes like this. Things have changed since I last worked at this level, so I'm glad I looked it up. I knew these principles, but it is interesting to see the changes that have occurred in recent years!

Performing a read is similar. Memory is set aside, and the address is provided in the call to the read() function in the C library. The library makes the system call, which tells the disk controller what data to get. This comes back in kernel memory, and the kernel then has to copy this into the buffer provided by the calling program.

Fortunately, these details aren't usually something developers need to think about. But there is something relevant going on here. When reading or writing, the kernel has to copy memory between the process's address space and the kernel's address space. This takes time.

When the kernel needs to perform read or write operations, it will usually use a different mechanism to normal read/write. Instead, the kernel will map a file into memory. This is done with code that does the same thing that swap space does. When a file is mapped in, then an address range is used to represent the contents of the file. Read from a given address, and you are actually reading from the file at that location. Write to an address and you will write to the file.

When reading from an address, the CPU will check to see if that address matches memory that is currently active. If so, then it will read directly from the appropriate memory (there is hardware support to get to the associated memory). If not, then the kernel will find the appropriate part of the file, find memory that is vacant, or else vacate some memory that hasn't been used for a bit, and then load a portion of the file into the associated memory address. These portions are called pages and on most desktops these are 4kb. So if the data is already in memory, you get it instantly, and there is no copying like there was in the normal read operation. If it's not resident in memory, you have to read from the disk, as usual, but then you have the address it was read into, and again, there is no copying.

Writing is similar. When you write to memory that is mapped, the page gets marked as dirty and is scheduled for writing to disk. If you write to it again, then it is still dirty and still scheduled, and you've just modified some memory. It will eventually get written to disk: if enough time has elapsed; if the application explicitly requests an operation to force it to disk; or if the OS needs the space for another page to be read into memory.

In summary, read/write operations:

are done mostly synchronously (though writes can be delayed by the OS, and reads can come from caches).
always requires a system call.
usually involve disk access.
always require a memory copy.

Accessing data in a mapped file:

does not require a system call.
does not need a memory copy.
often reads from cached data.
always writes to a cache.

... where "cache" is the page to be read from or written to.

The principle problems with memory mapping files are:

They take up address space (this was an issue for 32 bit systems).
They fix the size of the file.

That last one is a doozy if you want to do lots of writing. However, files are mapped by region, so there is nothing wrong with mapping the parts of the file that you need and continuing to append to the file. You can then map some of that new region when you need it.

Mapped Byte Buffers

Java's NIO uses buffers to represent memory-mapped files. This is handy for us, since we have layered our data structures over buffers already!

So how might this look? Well instead of explicit read and write operations, we set up the buffer to be a file mapping in the constructor. To give a bit of flexibility, we can set up some new rules:

If the file is larger than the requested number of Elements, map the entire file anyway (so we can get more elements than we asked for).
If the file is smaller than the request number of Elements, then expand the file to be large enough. The next available element will be one beyond the end of where the file used to stop.
When closing the file, truncate it to whatever was used. This is wherever the nextAvailable position is.

public BufferList(String filename, int length) throws IOException {
  long fileLength = ELEMENT_SIZE * length;
  File ioFile = new File(filename);
  boolean fileExists = ioFile.exists();
  file = new RandomAccessFile(filename, "rw");
  if (fileExists) {
    nextAvailable = (int)(file.length() / ELEMENT_SIZE);
    // check if the file is longer than what was requested, or shorter
    if (file.length() > fileLength) {
      fileLength = file.length();
    } else if (fileLength > file.length()) {
      file.setLength(fileLength);
    }
  } else {
    // brand new file
    nextAvailable = 0;
  }
  fileChannel = file.getChannel();
  buffer = fileChannel.map(FileChannel.MapMode.READ_WRITE, 0, fileLength);
}

public void close() throws IOException {
  fileChannel.truncate(nextAvailable * ELEMENT_SIZE);
  fileChannel.force(true);
  file.close();
}

Otherwise, the read and write operations are removed, and nothing else changes!

Using the New BufferList

How do we create the BufferList with the same elements as our last example? Almost the same way:

public static void main(String[] args) throws IOException {
  BufferList bufferList = new BufferList("buffer.bin", 20);
  BufferList.Element list1, list2, list3;
  list1 = bufferList.addToHead(34);
  // ... list construction goes here ...
  list3 = bufferList.addToHead(list3, 3);
  System.out.println(list1);
  System.out.println(list2);
  System.out.println(list3);
  writeLists(list1, list2, list3);
  bufferList.close();
}

The only change is the call to the constructor, and now we're closing the BufferList.

How does it run? The same as the previous version:

$ rm -f buffer.bin
$ java mapped.MultiExample
1, 1, 2, 3, 5, 8, 13, 21, 34
5, 6, 7, 8, 13, 21, 34
3, 1, 4, 1, 5, 9, 2, 6

Don't run it without deleting the buffer.bin file first. We've told it not to expand, and we've limited it to 20 elements, so it will complain if it is given an already full file and it tries to allocate new Elements.

Reading the data again is also the same:

public static void main(String[] args) throws IOException {
  BufferList bufferList = new BufferList("buffer.bin", 20);
  for (BufferList.Element list: readLists(bufferList)) {
    System.out.println(list);
  }
}

Running it shows the expected data:

$ java mapped.MultiExample2
1, 1, 2, 3, 5, 8, 13, 21, 34
5, 6, 7, 8, 13, 21, 34
3, 1, 4, 1, 5, 9, 2, 6

Because the format is identical, we can memory map the file creates by the previous buffered version, and print it successfully:

$ java buffer.MultiExample    # Creates file using write()
1, 1, 2, 3, 5, 8, 13, 21, 34
5, 6, 7, 8, 13, 21, 34
3, 1, 4, 1, 5, 9, 2, 6
$ java mapped.MultiExample2   # Reads file using memory mapping
1, 1, 2, 3, 5, 8, 13, 21, 34
5, 6, 7, 8, 13, 21, 34
3, 1, 4, 1, 5, 9, 2, 6

The full sources can be found on Github.

Caveat

While this technique works for files up to 2GB in size, that's as much as a buffer can address. Unlike allocated buffers, memory-mapped files can use very large buffers since the operating system only allocates memory for the part of the file you are currently using. Because of this, files larger than 2GB can be mapped using an array of MappedByteBuffers. Getting an element is managed by the constructor finding the correct map for the element index, and taking a slice from that. The rest of the implementation stays unchanged.

Recap

We've taken our existing buffer-backed data structures and memory-mapped the buffers to files. This is faster and keeps the files tracking the live data as it's changing in the program. The final program is nearly the same but does not use read/write operations for the buffer at all.

(Although, we are still using read/write for the metadata. The metadata can also be mapped, if we want).

Next, we will look at a different kind of data structure: a binary tree.