Johannes Lichtenberger for SirixDB

Posted on Oct 6, 2020

Query the full history of your JSON database

#contributorswanted #hacktoberfest #java #database

Contribute to the Core and the Svelte based Frontend

The following diagram depicts the evolution of states through modification events in SirixDB:

Introduction

First and foremost, you probably don't need a database system to handle small JSON files, ranging to a few megabytes.

However, if you have to manage and query GBs of data, you should use a database system.

Usually, database systems, however, are not designed to keep the full history of your data. Often, the system overwrites data during a change or keeps the data for a short time. The latter usually happens due to transactions, which currently read slightly outdated data. Thus, a garbage collector has to wait until all reading transactions finish. Then, it's able to delete the old data.

Instead, SirixDB makes a huge persistent tree, durable during commits. It only ever appends data. Every revision is indexed, whereas the revisions share unchanged page-fragments. Think of it like Git, but on a sub-file level. Persistent trees are also common in functional languages like Haskell and Closure. A transaction commit, which serializes pages in a postorder traversal, is depicted in the following image:

Usually, mapping these persistent in-memory structures to a durable device adds write amplification, as the full leaf-to-root path has to be adapted. We, however, use a keyed trie for our main document store. Thus no structural changes occur as in B-trees. Furthermore, we developed a new sliding snapshot algorithm to version the data pages and store variable-sized page fragments. Therefore only a few changed records are written to a new page-fragment. Reconstructing an in-memory page involves reading a sliding window of page-fragments from random locations in parallel.

Due to the requirement for fast, random, fine granular reads, modern hardware as Byte Addressable NVM will be essential to good performance.

SirixDB stores the JSON data in a binary format, whereas each transaction is bound to a specific revision. Currently, N read-only transactions on a resource can coexist with one read-write transaction on a resource (represents the JSON data).

The encoding is as follows, except that we lately also introduced a pointer from the parent to the last child node.

Query Your Data

The system uses and extends an XQuery 3.0 processor to process both XML and JSON data.

We're able to import JSON files from a specific directory via:

jn:load('mycol.jn', (), io:ls('/home/johannes/json-data', '\\.json$'))

It'll create a database named mycol.jn and several resources.

We can also store a bunch of JSON-strings within several resources in the database (mycol.jn):

jn:store('mycol.jn',(),('["bla", "blubb"]','{"foo": true}'))

Note, that by when JSON-strings we have to replace true and false with true() and false() as well as null with jn:null().

We can then query the database collection via:

for $doc in jn:collection('mydocs.col') return $doc

Or, if we want to open a specific resource in the database, we can use:

jn:doc('mydocs.col', 'myresource1')

Furthermore, we can update the resource. Let's say we have the following small JSON file:

{
  "foo": [
    "bar",
    null,
    2.33
  ],
  "bar": {
    "hello": "world",
    "helloo": true
  },
  "baz": "hello",
  "tada": [
    {
      "foo": "bar"
    },
    {
      "baz": false
    },
    "boo",
    [],
    {}
  ]
}

We can insert a JSON object with:

insert json { "name": "keyword" } into jn:doc('mydocs.col', 'myresource1')=>tada=>foo

We dereference object field names with the => operator. It even can drill down into arrays and find the value for the field name.

The update-operation will create a new revision with the field name and the added value. For instance

jn:doc('mydocs.col', 'myresource1')=>tada[[0]]

will select the first object in the array value of the field name tada.

{
  "foo": "bar",
  "name": "keyword"
}

We implicitly query revision 2. If we instead query revision 1, we'll still retrieve the old JSON object.

jn:doc('mydocs.col', 'myresource1', 1)=>tada[[0]]

Note that we specified one as the third argument of the function. The output is:

{
  "foo": "bar"
}

Instead of providing a numeric value to the third parameter we can also open a specific revision by a timestamp (here: to check how a specifc resource looked like in April 2018):

jn:open('mydocs.col', 'myresource1', xs:dateTime(\"2018-04-01T05:00:00-00:00\"))

With the function open-revisions we're able to load all revisions of a resource between two points in time:

jn:open('mydocs.col', 'myresource1', xs:dateTime(\"2018-04-01T05:00:00-00:00\", xs:dateTime(\"2019-04-01T05:00:00-00:00\"))

The following query will insert an object into the array value dereferenced by =>foo as the second item:

insert json { "name": "keyword" } into jn:doc('mydocs.col', 'myresource1')=>foo at position 1

Using

jn:doc('mydocs.col', 'myresource1')=>foo

we get the output:

[
  "bar",
  {
    "foo": "bar"
  },
  null,
  2.33
]

Likewise, we can replace a JSON value with

replace json value of jn:doc('mydocs.col', 'myresource1')=>tada[[0]] with "yes"

or delete a value:

delete json jn:doc('mydocs.col', 'myresource1')=>tada[[0]]

If you submit the queries through the REST-API, you first have to be authorized. Thus, a revision, as in Git, also stores the author's name and a UUID.

To get the author name, who committed a specific revision:

jn:author-name(jn:doc(...))

To get the UUID:

jn:author-uuid(jn:doc(...))

You can also project fields:

jn:doc('mydocs.col', 'myresource1')=>bar{hello}

This will output:

{
  "hello": "world"
}

Time travel functions

Besides, we can use a bunch of time travel queries. The following functions retrieve different versions of a JSON item:

jn:future($item as json-item(), $includeSelf as xs:boolean) as json-item()*

Function for selecting the json-item in the future or the future-or-self. The first parameter is the context item. The second parameter denotes if the current item should be included in the result or not.

jn:past($item as json-item(), $includeSelf as xs:boolean) as json-item()*

Function for selecting the json-item in the past or the past-or-self. The first parameter is the context item. Second parameter denotes if the current item should be included in the result or not.

jn:all-times($item as json-item()) as json-item()+

Function for selecting the json-item in all revisions.

jn:first($item as json-item()) as json-item()?

Function for selecting the json-item in the first revision.

jn:last($item as json-item()) as json-item()?

Function for selecting the json-item in the last / most-recent revision.

jn:previous($item as json-item()) as json-item()?

Function for selecting the json-item in the previous revision.

jn:next($item as json-item()) as json-item()?

Function for selecting the json-item in the next revision.

To get an item in all revisions, in which it has been changed (or inserted):

jn:history($item as json-item()) as json-item()?

Diffing

We're of course also able to retrieve diffs:

jn:diff('mydocs.col', 'myresource2', 1, 3)

This function compares revision one and three of a resource called myresource2.

The output format is a JSON-string:

{
  "database": "json-path1",
  "resource": "shredded",
  "old-revision": 1,
  "new-revision": 3,
  "diffs": [
    {
      "insert": {
        "nodeKey": 26,
        "insertPositionNodeKey": 1,
        "insertPosition": "asFirstChild",
        "deweyID": "1.17.9",
        "depth": 2,
        "type": "jsonFragment",
        "data": "{\"tadaaa\":\"todooo\"}"
      }
    },
    {
      "delete": {
        "nodeKey": 13,
        "deweyID": "1.17.49",
        "depth": 2
      }
    },
    {
      "update": {
        "nodeKey": 15,
        "deweyID": "1.17.65",
        "depth": 2,
        "name": "tadaa"
      }
    }
  ]
}

For instance, this format is also used by our GUI, which we currently develop based on Svelte.

It denotes that SirixDB compared revisions 1 and 3. Furthermore, the different diff types are insert, delete, update and replace (the latter not shown). The insertPositionNodeKey is the context node, where an insert occurs, the insertPosition denotes if it's inserted as a first child, as the right sibling, or the left sibling.

We can also diff a subtree and specify two additional parameters. The root node, which is going to be compared, and the depth:

jn:diff('mydocs.col', 'myresource2', 1, 3, 7453, 2)

This compares also revision 1 and 3 of the myresource2 resource in the mydocs.col database. However, this time, the diff starts at the node, denoted by its unique nodeKey 7453. Furthermore the diffing should skip descendants, which are deeper than 2 levels.

In order to get the nodeKey of a specific item, we can use:

sdb:nodekey(...)

Conclusion

SirixDB offers powerful ways to query the full history of your data.

We omitted the creation of index structures, but you can also create secondary index structures with XQuery, which are automatically versioned as well.

Consider the following document (serialized SirixDB resource with two revisions):

{
  "sirix": [{
    "revisionNumber": 1,
    "revision": {
      "foo": ["bar", null, 2.33],
      "bar": {
        "hello": "world",
        "helloo": true
      },
      "baz": "hello",
      "tada": [{
        "foo": "bar"
      }, {
        "baz": false
      }, "boo", {},
        []
      ]
    }
  }, {
    "revisionNumber": 2,
    "revision": {
      "tadaaa": "todooo",
      "foo": ["bar", null, 2.33],
      "bar": {
        "hello": "world",
        "helloo": true
      },
      "baz": "hello",
      "tada": [{
        "foo": "bar"
      }, {
        "baz": false
      }, "boo", {},
        []
      ]
    }
  }]
}

We can create a Content-And-Structure (CAS) index as follows:

let $doc := jn:doc('mycol.jn','mydoc.jn')
let $stats := jn:create-cas-index($doc, 'xs:string', '/sirix/[]/revision/tada//[]/foo/[]/baz') return {"revision": sdb:commit($doc)}

In order to answer the query

let $result := jn:doc('mycol.jn','mydoc.jn')=>sirix[[2]]=>revision=>tada[.=>foo=>baz >= 'baa' and .=>foo=>baz <= 'brr'] return $result

Furthermore, we didn't mention FLOWR expressions, which are the cornerstone of XQuery. A simple example with an implicit join:

for $i in jn:doc('mycol.jn','mydoc.jn')=>paths=>"/consolidated_screening_list/search"=>get
let $j := $i=>parameters=>name
return for $k in $j
       where $k eq 'keyword'
       return { "result": $i, "nodekey": sdb:nodekey($i) }

DEV Community