DEV Community

loading...

Asami

Paula Gearon
Just a girl, standing before a compiler, asking it to love her
・7 min read

Motivation

For a long time, I had been thinking that I would like to redesign Mulgara storage to overcome some of the deficiencies that had become clear over time. I was also thinking that I would like to use a more Functional approach, to make the code more tractable and reliable. But this would still take a lot of work that I wasn't sure I wanted to apply.

In 2008 I came up with some new index designs and had started implementing these for Mulgara. They were showing some promise, but still had a long way to go. However, a change of job soon had me ignoring storage and looking a rules systems again.

Rules

My original implementation of a Rules engine was in Mulgara, and hence was built in Java. My second was on a SPARQL store (I tried a few, but mostly was using Jena Fuseki). This was at the company Revelytix, which gave me an opportunity to use Clojure for the first time. I'd learned Scheme through SICP some years before, I'd adopted a lot of Functional Programming principles, and I'd learned Haskell and Scala, but Clojure was the first time I felt I was really productive with these concepts.

In 2013 I started re-implementing some of Mulgara's memory-mapping approach using Clojure and got the basic idea of blocks working. This was important for abstracting disk access, particularly as it allows memory-mapping files beyond the 2GB limit that Jave imposes due to backward compatibility with 32-bit architecture. Once this was done I didn't take it further at the time.

It was while using Clojure with OrientDB, Datomic and Neo4J, that I started getting a feel for graph databases that weren't based on RDF and SPARQL. They are all very similar in many ways, and it occurred to me that I could build a graph rules engine that was abstracted from the database, only requiring an adapter for each type. With this in mind, I started building a new rules engine, with an initial adapter for Datomic.

I was pleased with how well this worked and showed my then-boss, Craig while we waiting in an airport together. He was very interested in it and asked me if I would be prepared to develop it for Cisco. I agreed, so long as I could keep it open source, which Craig was pleased to do. Thus, Naga was born.

Craig did have one request though. He did not want to tie this to a commercial database, so would I be able to build something of my own? At this point, neither of us were aware that Datascript had been started, or else I would have opted to use that.

My first thought was just how big a project Mulgara had become, but then I realized that the only features I needed were:

  • Index lookups by triple-pattern.
  • Counting the size of the results of index lookups.
  • Inner join operations.

I didn't even need to do this in storage. I could handle all of this in memory, and if I wanted to scale up, then that's when I could move to an existing database. So I agreed.

The first version worked well and was the basis for the talk I gave on Naga at Clojure/conj 2016. The code for this was also small:

  • The index was a simple nested-map structure. Rather than storing triples in sequence, this allows duplicates in the first columns to be skipped. Also, the values appeared directly in the statements, and were not mapped to numbers.
  • The join operation was a relatively easy nested loop, using techniques borrowed from Mulgara.

I also realized that since queries were being machine-generated, then there should be a query planner to ensure that they did not get executed in an inefficient order. This appeared in the same namespace as the rest of the query processing.

Asami

Shortly after presenting Naga in Austin Craig started asking for new features. The first one was to make it compatible with ClojureScript so that we could use it directly in a web page instead of through a service. I had only experimented with ClojureScript, so this was an opportunity to use it in production.

Then I was asked for more features, both in rules and in querying the resulting data, which meant implementing a much more capable query engine, and query planner. As the code started to expand, I realized that the database had taken on a life of its own, and so I split it into its own project. In keeping with our in-joke of a naming scheme based on the Avatar and Korra cartoons, I named the new database "Asami".

I liked this name, since the character of Asami doesn't have powers like the other characters do, so she builds her own devices to make up for it. 😊

Querying

The first step was to provide an API for querying Asami with something that wasn't as low-level as the original approach. I copied the Datomic syntax for this, as it was compatible with how I was representing queries already, and it did not require parsing.

This made it easier to introduce syntax for new queries and then building the implementation. Some of the new features here are:

  • Negations, via the not operator.
  • Disjunctions, via the or operator.
  • Nested Conjunctions, via the and operator.
  • Filter expressions.
  • Binding expressions.
  • Passing data into a query to bind to a value.
  • Aggregates, via the functions: count, sum, avg, first, last, max, min.

Most of these operations are performed in a similar way to Mulgara, though I've reimplemented everything from first principles.

An important point to note here is that most (not all!) of the files that implement the above in Mulgara total 11,251 lines of Java. Asami does it in 783 lines of Clojure.

This illustrates a few things:

  • Clojure allows for symbolic manipulation far more easily than Java.
  • Java manages data structures through individual classes, which takes a lot of boilerplate code. In Clojure, it is more natural to use the built-in data structures, which simplifies things greatly.
  • I could never have attempted this as an individual without a functional language with the facilities that Clojure offers.

Storage

Mulgara executes queries by requesting statements that match statement patterns and then performing various operations on each of these results.

The statement patterns, or constraints are typically in the form of a triple, (with a fourth value implied from context or simply not needed). They contain ground values (such as URIs or strings), and variables. We refer to finding the statements that match a pattern as resolving the pattern. A pattern resolution binds the variables in the pattern to the parts of the statements that match it.

For instance, in the following statements:

:a :first-name "Fitzwilliam"
:a :last-name "Darcy"
:a :friend :c
:b :first-name "Elizabeth"
:b :last-name "Bennet"
:c :first-name "Charles"
:c :last-name "Bingley"
Enter fullscreen mode Exit fullscreen mode

Then the constraint pattern: [?person :first-name ?name]
would resolve to the statements:

:a :first-name "Fitzwilliam"
:b :first-name "Elizabeth"
:c :first-name "Charles"
Enter fullscreen mode Exit fullscreen mode

The bindings would be pairwise for the variables of ?person and ?name:

{?person = :a, ?name = "Fitzwilliam"}
{?person = :b, ?name = "Elizabeth"}
{?person = :c, ?name = "Charles"}
Enter fullscreen mode Exit fullscreen mode

Queries are answered by resolving constraint patterns and combining the results according to the operations in the query. I presented a talk describing the operation of this at ClojureD in 2019.

The operations are part of the "Query Engine", and the resolution step is part of an API for talking to storage. This abstraction makes it possible to store the data in various ways, or even to reinterpret other data as a graph, so long as an API can be implemented to resolve constraint patterns into bindings.

Mulgara has a Resolver interface to describe this API. Asami adopts the same approach, using the Graph protocol.

Mulgara has a number of Resolver implementations, including the transactional store, an RDF document parser, relational databases, GIS information, Lucene data, in-memory data, network-accessible APIs, and many more.

The first Graph implementation for Asami was a simple in-memory data structure, described in my ClojureD talk. The code for this appears in asami.index. This file started much smaller (as referenced above), but has since expanded with the needs extended functionality, such as transactions, and transitive closure operations.

Some of the API needs of our project at Cisco required a weighting for various edges, so a second Graph in-memory structure that could record multiples of the same edges (a so-called "Multi Graph"). This is found in asami.multi-graph.

Asami is in the process of gaining a new set of Graph implementations, based on structures inspired by Mulgara, though with some major differences. The first is that the blocks are not necessarily backed by files, and the second is reduced complexity, fewer operations, improved efficiency, and greater utility. My next post will be about this, but the general structure is already documented on the Asami Wiki.

The astute reader may notice a few differences from Mulgara already, but I intend to explain why those differences exist, and the benefits they confer.

Discussion (0)