Memgraph is an in-memory graph database that recently added support for working with data that cannot fit into memory. This allows users with smaller budgets to still load large graphs to Memgraph without paying for (more) expensive RAM. However, expanding the main-memory graph database to support disk storage is, by all means, a complex engineering endeavor. Let’s break this process down into pieces.
Disk-based databases have been, for a long time, a de facto standard in the database development world. Their huge advantage lies in their ability to store a vast amount of data relatively cheaply on disk. However, the development can be very complex due to the interaction with low-level OS primitives. Fetching data from disk is something that everyone strives to avoid since it takes approximately 10x more time than using it from main memory. Neo4j is an example of a graph, an on-disk database that uses disk as its main storage media while trying to cache as much data as possible to main memory so it could be reused afterward.
In-memory databases avoid the fundamental cost of accessing data from disk by simply storing all its data in the main memory. Such architecture also significantly simplifies the development of the storage part of the database since there is no need for a buffer pool. However, the biggest issue with in-memory databases is when the data cannot fit into the random access memory since the only possible way out is to transfer the data to a larger and, consequently, more expensive machine.
In-memory database users rely on the fact that durability is still secured through durability mechanisms like transaction logging and snapshots so that data loss does not occur.
Larger-than-memory architecture describes a database architecture when the majority of computations are still within the main memory, but the database offers the ability to store a vast amount of data on disk, too, without having the computational complexity of interacting with buffer pools.
The larger-than-memory architecture utilizes the fact that there are always hot and cold parts of the database in terms of accessing it. The goal is then to find cold data stored and move it to the disk so that transactions still have fast access to hot data. Cold data identification can be done either by directly tracking transactions’ access patterns (online) or by offline computation in which a background thread analyzes data.
The second very important feature of the larger-than-memory architecture is the process of evicting cold data. This can be done in two ways:
- DB tracks the memory usage and starts evicting data as soon as it reaches a predefined threshold.
- Eviction can be done only when new data is needed.
Different systems also behave differently regarding transaction management. If the transaction needs data that is currently stored on the disk, it can:
- Abort the transaction, fetch data stored on the disk, and restart the transaction.
- Stall the transaction by synchronously fetching data from the disk.
The question is, what happens when the transaction data cannot fit into random access memory? In Memgraph, we decided to start with an approach that all transaction data must fit into memory. This means that some analytical queries cannot be executed on a large dataset, but this is the tradeoff we were willing to accept in the first iteration.
Memgraph uses RocksDB as a key-value store for extending the capabilities of the in-memory database. Not to go into too many details about RocksDB, but let’s just briefly mention that it is based on a data structure called Log-Structured Merge-Tree (LSMT) (instead of B-Trees, typically the default option in databases), which are saved on disk and because of the design come with a much smaller write amplification than B-Trees.
The in-memory version of Memgraph uses Delta storage to support multi-version concurrency control (MVCC). However, for larger-than-memory storage, we decided to use the Optimistic Concurrency Control Protocol (OCC) since we assumed conflicts would rarely happen, and we could make use of RocksDB’s transactions without dealing with the custom layer of complexity like in the case of Delta storage.
We’ve implemented OCC in a way that every transaction has its own private workspace, so potential conflicts are detected at the commit time. One of our primary requirements before starting to add disk-based data storage was not to ruin the performance of the main memory-based storage. Although we all knew there was no such thing as zero-cost abstraction, we managed to stay within 10% of the original version. We decided to use snapshot isolation as an appropriate concurrency isolation level since we believed it could be the default option for the large majority of Memgraph users.
As always, not everything is sunshine and flowers, especially when introducing such a significant feature to an existing database, so there are still improvements to be made. First, the requirement that a single transaction must fit into memory makes it impossible to use large analytical queries.
It also makes our LOAD CSV command for importing CSV files practically unusable since the command is executed as a single transaction. Although RocksDB is really good, fits really well into our codebase, and has proved to be very efficient in its caching mechanisms, maintaining an external library is always hard.
Albeit the significant engineering endeavor, the larger-than-memory architecture is a super valuable asset to Memgraph users since it allows them to store large amounts of data cheaply on disk without sacrificing the performance of in-memory computation. We are actively working on resolving issues introduced with the new storage mode, so feel free to ask, open an issue, or pull a request. We will be more than happy to help. Until next time