When talking about a key-value store in a backend system, the very first thing that comes to my mind is a cache storage like Redis. Caching is a technique used to avoid hitting database multiple times for the same retrieval of data, thereby serves Read requests more efficiently. However, fitting everything in a hash map or in memory is limited when the size of data grows over time. Although there are ways to optimize how data is stored (via data compression or separately storing frequently accessed data (in memory) and the rest (in disk)), a system will eventually require a distributed key-value store to handle more traffic.
When designing a distributed key-value store system, CAP theorem has to be considered: Consistency, Availability, and Partition tolerance. In fact, partition tolerance is almost inevitable in modern systems as network failures cannot be avoided all the time. So, we have to make tradeoffs between consistency and availability in reality. Here're 3 possible combinations:
- CA: choose consistency and availability over partition tolerance (unrealistic)
- CP: choose consistency and partition tolerance over availability
- AP: choose availability and partition tolerance over consistency
Given these tradeoffs, one should further consider the followings:
- Data partition and repetition
- Consistency (strong, weak, eventual)
- Inconsistency resolution
- Handling failures (temporary, permanent)
- Write and read paths
When we partition or replicate data, "consistency hashing" can be used to ensure minimum data movement and even distribution of data amongst key-val stores. Checked!
To ensure consistency, "Quorum consensus" can be used to guarantee consistency for both read and write operations. Quorum consensus basically states that
Given N number of replicas, W as a write quorum size, R as a read quorum size,
- For a write operation to be considered successful, write operation must be acknowledged from W replicas
- For a read operation to be considered successful, read operation must wait for responses from at least R replicas
For example, if a data is replicated to server0, server1, server2, and W = 1, a coordinator must wait for 1 acknowledge from one of the servers.
If W + R > N, strong consistency is guaranteed.
There're 3 consistency models that can be considered:
- Strong consistency: all replicas or nodes will see up-to-date data all the time
- Weak consistency: somewhere in the middle of strong and eventual consistency
- Eventual consistency: given enough time, all updates will be propagated and all replicas are consistent
To resolve inconsistency amongst replicas, versioning and vector clocks can be used to detect and reconcile conflicts.
- Versioning technique basically treats each data modification as a new immutable version of data.
-
Vector clock is a [server, version] pair associated with a data item. For example, D([s1,v1], [s2,v2], ..., [sn,vn]), where D is data item, sn is nth server, and vn is nth version. Vector clocks has to choose one of the following operations:
- Increment vi if [si,vi] exists
- Otherwise, create a new entry [si, v1]
To handle failures (of replicas/servers/nodes), a system first needs to detect a failure. "Gossip protocol" can be used for that purpose. The following is how it works:
- each node maintains a node membership list (memberID and heartbeats)
- each node periodically increments its heartbeat counter and sends heartbeats to a set of random nodes (which in turn propagate to another set of nodes)
- once nodes receive heartbeats, membership list is updated. If the heartbeat hasn't increased for more than predefined periods, the member is considered as offline
After detecting a failure, "Sloppy Quorum" can be used to recover from temporary failures.
"Quorum" means the minimum number of members of an assembly to make proceedings of that meeting valid.
It works as follows:
- A system chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Then, use "hinted handoff" to achieve data consistency. For instance, when the down server is up, changes will be pushed back to the server to achieve data consistency. If a server is unavailable due to network for server failures, another server will process requests temporarily
On the other hand, "Anti-Entropy protocol" can be used to recover from permanent failures.
It keeps replicas in sync by comparing each piece of data on replicas and updating each replica to the nest version.
- A Merkle Tree is used for inconsistency detection and efficiently minimizing the amount of data transferred.
- A Merkle Tree has every non-leaf node labeled with the hash of the labels or values of its child nodes. To learn more about it, check out Gaurav's Video
Write Path
- The write request is persisted on a commit log file
- Data is saved in the memory caches
- When the memory cache is full or reaches a predefined threshold, data is flushed to SSTable in disk (sorted-string pairs)
Read Path
After a read is directed to a specific node, a system first checks if data is in the memory cache, else return data from the disk. In the disk, to find which SSTable contains the key, "Bloom Filter" is commonly used.
Top comments (0)