DEV Community

Johannes Lichtenberger
Johannes Lichtenberger

Posted on

Sirix.io - NoSQL document store which efficiently retains the history of your XML and JSON data and allows time-travel queries

I've already posted a few weeks ago, but I'd really love to get comments and I think I have to provide a bit of background information.

Any kind of questions, suggestions and help would be greatly appreciated as it's an Open Source project of mine (and was for others during my studies at the University of Konstanz 6 years ago).
Since then I spent countless ours to bring forth the idea of a versioned storage system, especially well suited for analytical tasks for time-varying data.

You can for instance query the history of your natively stored JSON data through an XQuery processor:

let $statuses := jn:open('mycol.jn','mydoc.jn', xs:dateTime('2019-04-13T16:24:27Z'))=>statuses
let $foundStatus := for $status in bit:array-values($statuses)
  let $dateTimeCreated := xs:dateTime($status=>created_at)
  where $dateTimeCreated > xs:dateTime("2018-02-01T00:00:00") and not(exists(jn:previous($status)))
  order by $dateTimeCreated
  return $status
return {"revision": sdb:revision($foundStatus), $foundStatus{text}}

This query opens a database/resource in a specific revision (as the database new it back then) based on a timestamp, then searches for created at times of Twitter posts greater than February 1st of 2018 and only for statuses which didn't exist in the previously revision. It then outputs the revision number and projects the found status text.

Especially I'd love to discuss what documentation you need, which next steps are necessary (cost based query optimizer, replication/partitioning...), API additions or changes...

I want to release version 0.9.0 in a couple of days, but maybe you have some stuff which is clearly missing, and I'd love any kind of comment :-)

https://sirix.io

Some of the features are:

  • Currently native XML and JSON storage (other data types might follow),
    Transactional, versioned, typed user-defined index-structures, which are automatically updated once a transaction commits.

  • Through XPath-axis extensions we support the navigation not only in space but also in time (future::, past::, first::, last::…). Furthermore we provide several temporal XQuery functions due to our integral versioning approach. Temporal navigation for JSON resources is done via builtin XQuery functions.

  • An in memory path summary, which is persisted during a transaction commit and always kept up-to-date.

  • Configurable versioning at the database level (full, incremental, differential and a new sliding snapshot algorithm which balances reads and writes without introducing write-peaks, which are usually generated during intermediate full dumps, which are usually written to).

  • Log-structured sequential writes and random reads due to transactional copy-on-write (COW) semantics. This offers nice benefits as for instance no locking for concurrent reading-transactions and it takes full advantage of flash disks while avoiding their weaknesses.

  • Complete isolation of currently N read-transactions and a single write-transaction per resource.

  • The page-structure is heavily inspired by ZFS and therefore also forms a tree. We’ll implement a similar merkle-tree and store hashes of each page in parent-pointers for integrity checks.

  • Support of XQuery and XQuery Update due to a slightly modified version of brackit(.org).

  • Moves for the XML layer are additionally supported.

  • Automatic path-rewriting of descendant-axis to child-axis if appropriate.

  • Import of differences between two XML-documents, that is after the first version of an XML-document is imported an algorithm tries to update the Sirix resource with a minimum of operations to change the first version into the new version.

  • A fast ID-based diff-algorithm which is able to determine differences between any two versions of a resource stored in Sirix optionally taking hashes of a node into account.

  • The number of children of a node, the number of descendants, a hash as well as an ORDPATH / DeweyID label which is compressed on disk to efficiently determine document order as well as to support other nice properties of hierarchical node labels is optionally stored with each node. Currently the number of children is always stored and the number of descendants is stored if hashing is enabled.

  • Flexible backend.

  • Optional encryption and/or compression of each page on disk.

Top comments (0)