Paula Gearon

Posted on Sep 10, 2019 • Edited on Apr 26, 2020

Save the Trees

Part 8, with part 7 back here, or go back to the beginning.

Revisiting Mapped Files

Now that we've seen the structural differences in tree nodes vs. linked list elements, we have what we need to update the BufferList class to handle trees instead of lists.

The first thing to consider is the layout of each tree node on disk:

public static int NODE_SIZE = 3 * Integer.BYTES;
private static int VALUE_OFFSET = 0;
private static int LEFT_OFFSET = 1;
private static int RIGHT_OFFSET = 2;

Note how the size of the node is in bytes, but the offsets are all given as integer offsets. If not for the IntBuffer view of the data, all the offsets would be in bytes, and our code would have to use these offsets, and then read the appropriate number of bytes before gluing them together into an Integer value. Fortunately, the IntBuffer class already does all of that for us.

With the linked list file we needed to store some metadata for the file, and put that in a separate file. This time we can do something a little different. The metadata that we want to store is the offset of the next node that can be allocated, and the offset of the node that forms the root of the tree. This is just 2 integers, which makes it possible to fit inside the space typically allocated for a Node. If we reserve the location of the first Node at offset 0 then we can use it to store this metadata instead. In this case, it means that we will not need a second file, which is convenient, though for more complex structures we will probably still want that metafile.

Using 0 as an invalid address mirrors the way the CPUs and Operating Systems usually work. The 0 address represents an invalid part of memory, meaning that every reference to it indicates invalid or uninitialized memory. Typically all of the first few physical addresses indicate an error, though the CPU will use them for its own purposes, such as holding addresses for code to run in the event of a Hardware Interrupt on older systems. So storing our metadata at the 0 offset has an analogy that I like.

A second benefit to this approach is that the NULL reference will now be 0 instead of -1. This is useful because Java ensures that all fresh buffers are initialized to the 0, which means that when a new Node is allocated, the children will already be set to NULL. (C and C++ do not do this unless you use calloc to allocate memory).

Now that we know that the first Node space will hold metadata, we have an expanded set of constants to work with:

public static int NODE_SIZE = 3 * Integer.BYTES;
public static int META_SIZE = 2 * Integer.BYTES;
public static int NULL = 0;

// META offsets
private static int NEXT_AVAILABLE_OFFSET = 0;
private static int ROOT_OFFSET = 1;

// Node offsets
private static int VALUE_OFFSET = 0;
private static int LEFT_OFFSET = 1;
private static int RIGHT_OFFSET = 2;

Constructing

With the constants in place, the data necessary for the tree can be declared. That includes the file, along with the channel so that it can be mapped into a buffer. We want the ByteBuffer representing all the data in the file, as this is the basis for overlaying Nodes onto the file. Finally, we want a buffer reference to access the metadata and a Node that represents the root of the tree.

private final RandomAccessFile file;                                                                                    
private final FileChannel fileChannel;
private ByteBuffer buffer;
private IntBuffer metaBuffer;
private Node root;

The constructor will do a similar job to what it did for the BufferList class. The file is opened, and if it is not long enough, it gets expanded, or if it's too long it gets truncated.

long fileLength = NODE_SIZE * length;
File ioFile = new File(filename);
boolean exists = ioFile.exists();
file = new RandomAccessFile(filename, "rw");
if (exists && file.length() > fileLength) {
  fileLength = file.length();
} else if (fileLength > file.length()) {
  file.setLength(fileLength);
}

Once the file is the expected size, it can be mapped. Those first few bytes of the mapped buffer are then captured as a buffer for the metadata.

fileChannel = file.getChannel();
buffer = fileChannel.map(FileChannel.MapMode.READ_WRITE, 0, fileLength);
metaBuffer = buffer.limit(META_SIZE).position(0).slice().asIntBuffer();

Finally, some data initialization can happen. If the file did not exist before, then the first position available to put a node in the offset for Node 1. Otherwise, that position will be read from the metadata. Also, the root of the tree will be null if the file is new, or else once the offset has been read from the metadata the root node can be constructed:

if (!exists) {
  setNextAvailable(1);
}
int rootIndex = metaBuffer.get(ROOT_OFFSET);
root = rootIndex == NULL ? null : new Node(rootIndex);

The setNextAvailable method is just updating the metadata

private void setNextAvailable(int next) {
  metaBuffer.put(NEXT_AVAILABLE_OFFSET, next);
}

Calling close() on the tree will just close of these files, as with the BufferList. We needn't include it here, but it will be in the final file that I'll link at the end.

Growing a Tree

Adding to the tree follows a very similar structure to the Javascript in the previous post.

public Node add(int value) {
  Node node = new Node(getAndIncNextAvailable(), value);
  if (root == null) {
    setRoot(node);
  } else {
    insertNode(root, node);
  }
  return root;
}

The main different here is the Node constructor which is taking the node ID for the next free node. Note that this is retrieved with a function that gets the ID and increments it:

private int getAndIncNextAvailable() {
  int next = metaBuffer.get(NEXT_AVAILABLE_OFFSET);
  if ((next + 1) * NODE_SIZE > buffer.capacity()) {
    throw new RuntimeException("Out of capacity");
  }
  metaBuffer.put(NEXT_AVAILABLE_OFFSET, next + 1);
  return next;
}

This just reads the next available value from metadata, increments the stored value, and returns the read value.

As with the BufferList class, if the entire buffer is full, then an exception will be thrown. In production, we would want to deal with this situation by expanding the file, but that's beyond what we're looking at for now. While considering production code, it is also worth considering that more than one thread could be calling this code at once. If this is the case, then this function should be protected with a synchronized keyword, to avoid a possible race condition for the next available value.

The insertNode method will be very similar to the insertNode method in for the Javascript tree.

private void insertNode(Node node, Node newNode) {
  int value = node.getValue();
  if (newNode.getValue() < value) {
    Node left = node.getLeft();
    if (left == null) {
      node.setLeft(newNode);
    } else {
      insertNode(left, newNode);
    }
  } else {
    Node right = node.getRight();
    if (right == null) {
      node.setRight(newNode);
    } else {
      insertNode(right, newNode);
    }
  }
}

The only differences from the original Javascript are that the this identifier is optional for methods on the current object, and because the left and right values are accessed with getter methods then we are only calling the method once and holding onto the returned value.

Getter methods are required in general with these objects, as they are wrapping operations that retrieve objects from the buffer, while the Javascript code is able to just follow memory references. We see this also in the methods for accessing the root of the tree, although because this is metadata that gets read on initialization, the getter does not need to touch the buffer:

private void setRoot(Node node) {
  root = node;
  metaBuffer.put(ROOT_OFFSET, node.getIndex());
}

public Node getRoot() {
  return root;
}

Printing

As before, printing is a recursive process that first prints the left branch, then prints the node value, and finally prints the right branch, with various if or ternary statements in place to make sure an appropriate number of commas are added:

public String toString() {
  return root.toString();
}

public static String treeString(Node element) {
  Node left = element.getLeft();
  Node right = element.getRight();
  return (left == null ? "" : treeString(left)) +
         Integer.toString(element.getValue()) +
         (right == null ? "" : treeString(right));
}

Nodes

As with the BufferList elements, nodes overlay a buffer. The first step to creating a node is to slice out that part of the buffer that the node is wrapping, and represents it as Integer values. If the node is initialized with just a buffer reference, it can be used to read what was already there, and if it is initialized with a value, it will write that value to the buffer.

private final IntBuffer intBuffer;
private final int index;

Node(int index) {
  this.index = index;
  int offset = index * NODE_SIZE;
  if (offset > buffer.limit()) {
    intBuffer = buffer.limit(offset + NODE_SIZE).position(offset).slice().asIntBuffer();
  } else {
    intBuffer = buffer.position(offset).limit(offset + NODE_SIZE).slice().asIntBuffer();
  }
}

Node(int index, int value) {
  this(index);
  setValue(value);
  setLeft(null);
  setRight(null);
}

The index is read-only, but the other values are all accessible via getters and setters, which all wrap buffer operations:

public int getValue() {
  return intBuffer.get(VALUE_OFFSET);
}

public Node getLeft() {
  int leftId = intBuffer.get(LEFT_OFFSET);
  return leftId == NULL ? null : new Node(leftId);
}

public Node getRight() {
  int rightId = intBuffer.get(RIGHT_OFFSET);
  return rightId == NULL ? null : new Node(rightId);
}

public Node setValue(int value) {
  intBuffer.put(VALUE_OFFSET, value);
  return this;
}

public Node setLeft(Node left) {
  intBuffer.put(LEFT_OFFSET, left == null ? NULL : left.getIndex());
  return this;
}

public Node setRight(Node right) {
  intBuffer.put(RIGHT_OFFSET, right == null ? NULL : right.getIndex());
  return this;
}

The complete code for BufferTree is here.

Using the Tree

Creating this tree and adding the digits of π is straightforward:

BufferTree bufferTree = new BufferTree("tree.bin", 25);
bufferTree.add(3);
bufferTree.add(1);
bufferTree.add(4);
bufferTree.add(1);
bufferTree.add(5);
bufferTree.add(9);
bufferTree.add(2);
bufferTree.add(6);
bufferTree.add(5);
bufferTree.add(3);

Alternatively, the add operations could be chained:
bufferTree.add(3).add(1).add(4).add(1).add(5).add(9).add(2).add(6).add(5).add(3);
But I didn't do it that way, because most use cases for populating a tree are when the data comes from external source, and not hard coded like this.

Printing the tree shows that all the data has been sorted:

System.out.println(bufferTree);

$ java tree.TreeExample
1, 1, 2, 3, 3, 4, 5, 5, 6, 9

What happens if the same program is run again, loading the data a second time?

$ java tree.TreeExample
1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 9, 9

That's all of the original data repeated and in order.

How about reading the data back? That's exactly the same, without the loading step:

public class TreeExample2 {
  public static void main(String[] args) throws IOException {
    BufferTree bufferTree = new BufferTree("tree.bin", 25);
    System.out.println(bufferTree);
    bufferTree.close();
  }
}

Resulting in:

$ java tree.TreeExample2
1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 9, 9

These example programs are also found at the bottom of the gist for the BufferTree.

Regrowth

This post was just a reiteration of the tree structure described in the previous post, combined with the buffer wrapping approach described in Part 6. This creates a binary tree on disk that can be re-read with ease.

We are now referring to the 0 offset in the buffer as an invalid address, which allows us to use the initial values in the buffer (which are always 0) to indicate null entries. This also allowed us to shift some of the metadata for the tree into the head of the file, though this will have limited utility as metadata requirements expand in later posts.

Given that most databases use some variant of B-Tree, this discussion of binary trees may seem like an unusual focus, but it will all become clear eventually. (I promise!)

But before then, we should look at how to balance these trees.

DEV Community