GhostDB - An Ultra Fast Distributed Cache

#octograd2020 #devgrad2020 #showdev #go

Overview of my Final Year Project

GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

The primary purpose of this system is to speed up dynamic database or API driven websites by storing data in RAM in order to reduce the number of times an external data source such as a databse or API must be read.

GhostDB provides a very large hash table that is distributed across multiple machines and stores large numbers of key-value pairs within the hash table. However, it's more than just a big hash table. GhostDB has multiple levels of on-disk persistence in the form of point-in-time snapshots and an append-only-file used to restore a cache node if it crashes, concurrent crawlers that will automatically evict stale data, fine grained metrics, and if configured correctly, can serve 200k requests per second from more than 1.5M concurrent keep-alive connections per physical server.

Users of GhostDB can configure GhostDB nodes on machines they have control over. The clients application servers will be capable of interacting with GhostDB through the GhostDB SDKs much the same way their application servers may interface with a MySQL or MongoDB instance.

Our SDKs are available for Python and Node.js (PyPi and NPM respectively) and are simple to use. With just 4 lines of extra code in your application servers, you are able to achieve up to a 25x increase in data retrieval speeds when compared to databases such as MongoDB and MySQL.

Making GhostDB distributed

This was one of the significant challenges my project partner (Connor Mulready) and I faced when developing GhostDB. As the nature of this system is to cache data, we didn't feel that building a truly distributed system was necessary. Instead, GhostDB is client side distributed, meaning no nodes in your cluster know about each other. The client decides what node data should be sent to based on the hashes of keys. This means that in the event a node fails, only k / n key-value pairs must be reassigned to new nodes(where k is the number of keys and n is the number of nodes). When the node that failed comes back online, we don't need to reassign all the key-value pairs back to that node as the node will rebuild itself if persistence is enabled.

We achieved this distribution through multiple layers of abstraction over an AVL-tree.

Where is GhostDB now?

We haven't just stuck GhostDB in a repository to gather dust. We believe it offers more features than some caching systems out there and has it's place among these systems.

Just today, we have released the first version of GhostDB for Linux which is available to download for free on our site www.ghostdbcache.com. We will be continuing the development of GhostDB into the future to take it a step further and make it bigger and better, with plans for a CLI and a central log storage and cluster monitoring platform.