During my last post, I wanted to start writing about what Asami does, rather than dwelling on the structure of Mulgara, but in so doing I missed an important part of the structure: The Statement Store. Asami uses a similar statement store, so I think it deserves a little time.
This post will be full of heavy detail. It's a useful exercise for me to write this since it's helping me review how Mulgara handles storage, and also because Asami is very similar.
As described in the previous post, Mulgara uses 6 separate indexes to store quads. Each one of these indices starts with an AVL tree, where the payload in each node stores the following:
- A reference to a data block in another file.
- The count of statements in that block.
- The first statement in the block.
- The last statement in the block.
As you can see, each node in the tree is associated with a block in another file. That block contains statements as a series of 4x64-bit
long values, up to the number of triples in that block. There is a 1:1 correspondence of AVL Nodes to blocks.
The best way to describe this structure is to describe the operations on it.
Search is based on constraint patterns. For instance, a constraint pattern that describes which predicate connects a given subject
A and object of
B would be represented with a pattern like:
A ?predicate B
Given that this provides the subject and object, but not the predicate (nor the graph), then these statements could be found using the index ordered by:
The object comes first on this index, so the search becomes:
B A * *
Where the first
* represents the
?predicate value, and the second is the unused graph.
Starting at the root of the AVL Tree, this pattern is compared to the "first", or “low” statement for the node. If it's lower, then the search continues on the left child node. Otherwise, the pattern is compared to the "last", or “high” statement for the node, and if it's higher then the search continues on the right child node.
It is possible for a pattern to compare as being between blocks, in which case there are no matching statements. But usually, a node is reached where the pattern is higher than the "first" statement and lower than the "last" statement, so anything the pattern could match must be within the range of the associated block.
At this point, the block is loaded from its file and the search continues there. The statements appear as a linear array of numbers, with each 4th number being the start of the next statement. A binary search finds the appropriate statement quickly.
If an exact match is found for the pattern, then the location of the first statement that matches is returned in a 1 element array. If no match is found, then a pair of consecutive locations are returned in a 2 element array. This indicates the position that matching statements would appear between.
Mulgara also does a second search, this time looking for the last possible match of pattern. This will typically go down the same branches of the AVL tree, meaning that they are cached and relatively cheap to search.
These two search results provide start and end positions for an iterator that returns all statements that match the provided pattern. This iterator also adds to the reference count for the index in the current transaction. This is decremented when the iterator is closed, which is also done in the finalizer of the iterator object. This informs the free-list for the index so that it can detect when freed nodes are safe to use again.
All nodes and statement blocks that have been copied into a transaction will be available for recycling when any iterators that can reach them are closed. This is done by adding their IDs to their respective free lists.
The first step for insertion is a search for the AVL Node representing where the statement should go. If the statement would go between 2 nodes, then the node with the smaller node count is selected. If both are the same size, then it picks the lower node.
The first thing to check is if the AVL Node was allocated in the current transaction. If so, then it can be modified. But if not, then a copy of the AVL Node is made, updating the transaction ID. This operation iterates up the branch of the tree to the root, copying nodes into the current transaction, either until the root is reached, or until encountering a node that is already in the current transaction.
For the node is being updated, its associated block will get updated too, meaning that it gets copied into the current transaction, if it's not already there.
The next thing to check for is if the block is already full. If so, then it needs to be split in half (by allocating a new block and copying the top half of statements into it), and a new AVL Node created for the block with the upper statements. Both the original AVL Node and the new one have their counts, and their high and low statements updated. The node/block that is to be given the statement is then the one that gets processed.
The block being updated is searched (using binary search) to find the location for the statement. An "array move" operation is used to slide all the statements up one position, and the new statement is written into the space created. The AVL Node is then updated to increment the statement count, and possibly change the high or low statement, if appropriate.
Yes, I know... it's a lot.
The delete operation is very similar to insertion, with a couple of differences.
If the statement is found then both the AVL Node and block will be copied into the current transaction. This time, blocks can't overflow, so no splitting occurs. Instead, if a block is emptied, it will be deallocated, putting it on the free list. Similarly, the AVL Node can be removed, putting it on the free list for its file.
An empty node in an AVL tree is expected to be removed in a single write step, but there is a small chance that rebalancing will need to iterate towards the root. The further it iterates, the more write operations take place, but this is offset by the probability of needing to take the next step decreasing exponentially.
Frustratingly, at the time, all the references we could find on rebalancing AVL trees during a deletion were wrong, which we discovered when the trees got out of balance. Finally, two of us sat down and we figured it out from first principles. That was an interesting morning. 😊
Just like insertions, the block update is done using an array move operation. The exception is if it's the highest statement in the block. In that case, the AVL Node just decrements the statement count and updates the high statement to the next one down.
There are various tradeoffs made in each part of this structure.
Statement blocks are full of uncompressed statements (as a tuple of 4 numbers), as this makes them fast to search and read. Space is left in each block, allowing for fast insertions. However, once a block's transaction has finished, it will never be written to again. The space at the end of that block is forever inaccessible. This is done for the sake of speed.
AVL Trees are binary. This gives them log(N) complexity in search like any other tree, but from a practical standpoint this will require more read operations than a tree with a fanout of K>2. Typically, a fanout of K would result in logKN steps, which is the same as log2N/log2K. If the fanout were, say, 32, then this could be about 5 times fewer steps.
The use of blocks with 1024 statements provides a reduction in the number of steps for tree traversal than a larger fanout would give. In this case, log2(N)-log21024 or log2(N)-10. This subtraction is not as effective as a linear scaling factor, but for "small" numbers (N up to the billions), the number of I/O operations is similar. The benefit of using a binary fanout (K=2) is that we get to use the AVL structure, which permits insertions with O(1) complexity. So we get similar read characteristics to a B-Tree, but better insertion characteristics.
Mulgara would typically use memory-mapped access for the files full of AVL Nodes, and regular read/write access for the block files. This provided the full speed benefits of memory mapping for tree traversal while sparing the address space from the space requirements of the statement blocks. As we moved to 64-bit architectures though, we started using memory mapping more liberally with these larger files as well. This allowed the OS to cache them as appropriate, and schedule them for writing more efficiently than we, as developers, were able to.
Finally, it is important to note just how many files were in use for all of this. Each index required:
- A file for the AVL tree.
- A file for statement blocks.
- A free list for freed AVL nodes.
- A free list for freed statement blocks. And this was multiplied by the 6 different indexes. So even if everything was being done very efficiently, the system was forced to maintain 24 files while updating storage. This involved a lot of unnecessary seeking, regardless of how efficiently the OS reordered it.
Similarly, the String Pool that I described in the previous post required:
- A file for the AVL tree.
- A free list for the AVL tree.
- 20 block files for the various sizes of data.
- 20 free lists, one for each block file.
While many of these files were not accessed for many operations, there was still a lot happening.
Of course, this does not cover the majority of what Mulgara did, but it does describe the principal structures, especially those that are relevant to Asami. Hopefully, I've provided sufficient background to explain the decisions taken in Asami, which I will get to in my next post.