I'm engineering an evolutionary, temporal Open Source data store which currently allows storing XML and JSON data natively.
SirixDB is a log-structured, temporal NoSQL document store, which stores evolutionary data. It never overwrites any data on-disk. Thus, we're able to restore and query the full revision history of a resource in the database efficiently. SirixDB ensures, that a minimum of storage-overhead is created for each new revision. We don't rely on a third party storage engine, but wrote it from scratch.
Furthermore each revision is indexed, thus it doesn't matter if a past revision or the most recent revision is queried.
Currently, SirixDB offers two built-in native data models, namely a binary XML store as well as a JSON store.
Some of the most important core principles and design goals are:
- Concurrency – SirixDB contains very few locks and aims to be as suitable for multithreaded systems as possible
- Asynchronous REST API – operations can happen independently; each transaction is bound to a specific revision and only one read-write transaction on a resource is permitted concurrently to N read-only transactions
- Versioning/Revision history – SirixDB stores a revision history of every resource in the database while keeping storage-overhead to a minimum. Read and write performance is tunable. It depends on the versioning type, which we can specify for creating a resource
- Data integrity – SirixDB, like ZFS, stores full checksums of the pages in the parent pages. That means that almost all data corruption can be detected upon reading in the future, as the SirixDB developers aim to partition and replicate databases in the future
- Copy-on-write semantics – similarly to the file systems Btrfs and ZFS, SirixDB uses CoW semantics, meaning that SirixDB never overwrites data. Instead, database page fragments are copied and written to a new location
- Per revision and per page versioning – SirixDB does not only version on a per-revision, but also on a per-page basis. Thus, whenever we change a potentially small fraction of records in a data page, it does not have to copy the whole page and write it to a new location on a disk or flash drive. Instead, we can specify one of several versioning strategies known from backup systems or a sliding snapshot algorithm during the creation of a database resource. The versioning type we specify is used by SirixDB to version data pages
- Guaranteed atomicity (without a WAL) – the system will never enter an inconsistent state (unless there is hardware failure), meaning that unexpected power-off won't ever damage the system. This is accomplished without the overhead of a write-ahead-log (WAL)
- Log-structured and SSD friendly – SirixDB batches writes and syncs everything sequentially to a flash drive during commits. It never overwrites committed data
I want to finalize the APIs, add some performance enhancements, fix bugs and listen to the community before hopefully releasing 1.0.0 sooner than later this year.
Afterwards, I first and foremost envision to horizontally scale SirixDB. Furthermore a SPA would be great as a web frontend and to embed interactive visual analytics approaches for comparing revisions. I'd also love to work on query rewrite rules for the query engine to really speed up queries.