In 2001, due to a need to store a lot of metadata and deciding that RDF was the way to do it, I was part of a team that designed and built the Mulgara graph database. It was originally a commercial product called the Tucana Knowledge Store (TKS) and then mostly open-sourced as Kowari, but we renamed it to Mulgara after the original codebase was acquired by Northrup Grumman.
After maintaining Mulgara for years, I eventually put it aside. This was not because I did not think it was a useful tool, but because it was too big to work on by myself in my spare time. This was exacerbated by being written in Java, which was both slower to develop in, and less fun.
In the years since, I've been thinking of the lessons that I learned while working on Mulgara, and how things could be done better. I have also been exposed to a number of concepts and technologies that were not well known or did not exist 20 years ago, and these have informed my ideas on how such a database could be built.
In recent years, I have had the chance to express a few of these ideas in the Asami database, though I still have a long way to go.
In the meantime, people have been asking me about the design of Asami, so I thought it might be useful to document some of the history and my plans for the future here. I apologize in advance for what may appear to be random musings.
This post will focus on the structure of Mulgara, and some of the reasons for those decisions.
When we were first trying to store large numbers of graph edges as triples, we discovered that it was extremely difficult to store them in a way that was searchable and could scale to the hundreds of GB we needed. Several of the developers tried different approaches, with varying degrees of success (my own did not scale well at all). But then the youngest and least experienced person in the office did a from-scratch implementation of an AVL Tree that was orders of magnitude faster than everyone else's attempts. This decided our course of action.
It also taught me that the youngest and least experienced person on a team can make contributions far beyond what the more senior people may expect. They have ideas, energy, and a lack of preconception which is extremely valuable and should never be overlooked. This has been a recurring theme in my career. I can't stress enough how important it is to have junior members on a team.
The company was primarily a Java shop, and at the time of all this, the JDK 1.4 Beta had just been released. This introduced the New IO package, which included memory-mapped files, which we were keen to try out. We convinced our engineering manager that using a Beta did not pose a significant risk, since 1.4 would be out of Beta by the time we released.
Unfortunately, the Pentium 4 systems we were developing on were based on a 32-bit architecture, which limited the address space of an application to 4GB, or around 2GB for file mappings (this depended on the OS, where shared libraries were mapped, etc). With this in mind, we worked to abstract our file I/O to use either memory-mapping or standard read-write operations, allowing the user to select which one to use at startup.
Just a few months before, a few of us had been impressed at an idea posted by Daniel Phillips on the Linux Kernel Mailing List for the TUX2 filesystem. This described "Phase Trees", which was my first exposure to functional data structures, though I did not know this at the time. The resilience and simplicity of this idea appealed to us, and one of my colleagues suggested that we should incorporate this into our AVL tree design.
The most important element of the AVL trees that we built was the tree Node. These are a constant sized buffer, containing representations for:
- A left child (possibly empty)
- A right child (possibly empty)
- A balance (-1, 0, or 1)
- A transaction ID (assigned when the node is created and immutable).
- A payload.
The payload was entirely dependent on the data being stored in the tree. It could contain everything known about the data, or it might contain partial information, along with a pointer to find the remaining data. The most important thing to know is that all AVL nodes were exactly the same size.
Note that the AVL nodes were all associated with a transaction. This was used for tracking which nodes were under active use. Modifications to the tree were grouped into a transaction which had an incrementing ID. Whenever a node needed to be modified, then this ID could be checked against the current transaction, and if it differed, then a new node was created, copying the existing node's data into it. That old node was then added to a list of nodes that were no longer in use, called the Free List.
The Free List was used to re-allocate nodes that were no longer in use. This was important for space usage, particularly when data was being removed as well as added. Back in the early 2000s, our hard drives were smaller (and slower) than today, so we were concerned with space, and reclaiming unused nodes was important. This stood in contrast to our mantra at the time of "Disk is cheap", but minimizing wastage was still a deeply ingrained habit from the 90s.
When any operations that read from the tree were in progress, they would lock a window of the free list, ensuring that any "freed" nodes in there could not be reused until those operations were closed off. This worked well and was reasonably efficient, but it did lead to extra work. It was also (IMO) the most complex part of the codebase.
Mulgara initially stored all of its data statements as triples, representing a tuple of
[subject predicate object]. It did not take long to realize that we wanted to group triples into identifiable graphs (which could then have information such as access and authentication properties), so we extended statements to quads of
[subject predicate object graph].
For statements to be easily indexed and retrieved, we wanted them to be the same size. This meant mapping each element of a statement into a number and then providing a way to map those numbers back to the original data. Given that RDF was based on URIs (actually, URI References in RDF 1.0, and IRIs in RDF 1.1) and Literals with a string representation, this meant mapping strings of variable length to numbers, and back. The module that did this was called the String Pool.
We built the String Pool with 2 indices and a whole lot of other files. The first index used a tree to provide searching for strings by regular comparison operations. The nodes of the tree held the beginning of the string, a pointer to another file to find the full string record, and finally, a number that represented the string (called a gNode). The second index was a series of regular-sized records where the offset of the record was the gNode multiplied by the record size. This record then contained the location of the full string record.
Finding the gNode for a string meant searching through the tree (possibly looking for the full string in the other file when the beginning of the string matched what was being looked for), and the tree node that was found contained the gNode in its payload. Searching for the string associated with a gNode meant going to an offset in the record file that was determined by:
gNode * size-of-record. This record contained the location of the string.
Strings themselves were also stored in files containing blocks of regular size. Everything up to 16 bytes long went into a file of blocks that were 16 bytes in length. Everything larger than 16, and up to 32 bytes went into a file that contained blocks up to 32 bytes. This continued with blocks doubling in size up to our maximum data size of 8MB. This meant that we had 20 different files, and each file had approximately 25% of unused space. This came from the fact that the data in a block was guaranteed to fill 50% to 100% of the block.
The Statement store initially used indices to sort and store triples in the following orders:
- Subject, Predicate, Object
- Predicate, Object, Subject
- Object, Subject Predicate
There are 6 possible orderings (calculated as
3! or 3 factorial). It turns out that the 3 orderings shown here allow someone to find all matching statements for any given group of subject, predicate, and object. Interestingly, the remaining 3 possible orderings also permit this. The two groups of orderings are distinct, in that any group that tried to contain ordering from more than one group would need more than 3 indices to be able to fulfill every possible query.
Once we moved to quads, things became more complex. Now there were
4! possible indices. We spent an afternoon mapping out every possibility and discovered that there are 4 groups of 6 indices that could be used to fulfill every possible query. One of these groups matched our existing indices quite closely:
- Subject, Predicate, Object, Graph
- Predicate, Object, Subject, Graph
- Object, Subject Predicate, Graph
- Graph, Subject, Predicate, Object
- Graph, Predicate, Object, Subject
- Graph, Object, Subject Predicate
A few years later I looked at this more closely and realized that it is related to N-dimensional hypercubes, where the number of groups is represented by opposing faces in the N-dimensional space. e.g. in 3D space, opposing faces of a cube appear in pairs, and there are 3 such pairs. 3 faces of the cube can be used to represent all of the space, with each opposing face providing equivalent information.
While Mulgara was very successful, there were several areas where I felt it could be improved. I did start on some of these in 2008, but then I moved back to working on rule engines (which Mulgara also has) in 2009. This also led to me working full time in Clojure, and eventually Datomic, and each of these eventually led me towards the design that Asami took on.
I is still more to describe how the statement indexes worked, and I will discuss that next.