DEV Community

Discussion on: Designing a Novel Temporal Database Storage Engine for Byte-Addressable Non-Volatile Memory

Collapse
 
johanneslichtenberger profile image
Johannes Lichtenberger

I think most storage engines are based on some kind of a key-value based storage (B+-tree, LSM-tree... and derivates). So, the document, graph, column or whatever can be built on top.

Mostly, however I think they are still designed based on the assumption that the storage device is slow and block-based. Index-structures, which fetch some records together however still have a value as NVM is not as fast as DIMM memory and it also seems that the byte addressability is a bit limited. I've read that still 128 Bytes or something like that are the most fine granular size when the NVM is used in persistence mode (app direct mode in Intel Optane storage -- but take it with a grain of salt) and not just as a memory replacement. Furthermore if it's possible to support either sequential reads or writes it's still a bit faster than random reads or writes. I think it's best to read and write a few records in one go, thus effectively splitting database pages in smaller variable sized fragments. That's also the idea of the sliding snapshot, to trail a full version of a page over a predefined number of revisions and otherwise only store changed records within a variable sized page (fragment).

I'm storing versioned AVL trees for indexing purposes in the leaf nodes of tries. However, it should be best if they entirely fit in main memory as should be the case nowadays :-) The only downside might be balancing the binary tree, which might be costly regarding the versioning.

However, in conclusion I think that the assumption of slow random reads is invalid already since the advent of SSDs. Now it seems that also random writes are much faster with NVM in comparison with SSDs and random reads are now almost as fast as sequential reads. This, combined with the fine granular reading / writing opens up a wealth of new opportunities and at least pages can be stored in a much more fine granular way (without fixed block sizes).

SirixDB is based on a system we developed at the University of Konstanz as a persistent, versioned XML database system during my studies (named Treetank back then). Marc Kramis came up with the basic idea of a huge persistent tree-based data structure for his PhD back in 2006 and I'm more than impressed by his vision about storage trends (mostly inspired by ZFS though and putting some of their impressive work to the sub-file granular level). His report is from 2008 BTW:

Growing Persistent Trees into the 21st Century