Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It's a fully managed, multiregion, multimaster database with built-in security, backup and restore, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second.
Sounds interesting! So let's dive into it. But please remember that the paper has been published in 2007. New researches have now showed that it is possible to have high consistency with sharding and relatively high availability (Spanner, Calvin, ...)
At the scale of Amazon, the performance of services and availability is critical. Services managing states (aka stateful) are one of the most difficult services to manage reliably.
Generally, stateful services are relational databases. Amazon identified various limitations:
- Overkill for simple application (for example: retrieve data with only primary key)
- Favor consistency over availability
DynamoDB has the following requirements:
- Latency SLA at 99.9% percentiles
- Favor write availability (at the cost of more complexity of read operations with conflict management)
- Easy management/deployment focused on availability
- Every node has the same role (no master or slaves for ex.)
- Automatic scalability by adding nodes
- Decentralized management to avoid single point of failure
As mentioned previously, DynamoDB is a distributed key-value database focused on high scalability and availability. To achieve those characteristics, key-value pairs are replicated and sharded (i.e. partitioned) across multiple nodes.
The core concept of Dynamo is the consistent hashing scheme used to distribute keys across nodes. The consistent hash algorithm maps the hash key space in a ring like in the figure below.
To explain how the sharding and replication works, I will first explain the best case scenario without any failures or concurrent edits. Afterwards, we will see how the system can cope with failures.
As explained above, the partitioning is performed with the consistent hash ring. The ring is split into multiple virtual nodes that are mapped to physical nodes. Using virtual nodes offer multiple advantages:
- Simplify the addition and removal of nodes
- Different virtual nodes allocations depending on the node hardware
- Reduce the workload skewness for each node as each physical node handles multiple virtual nodes
Each key is replicated according its preference list. This list is calculated using the next nodes in the ring (clockwise). For example in Figure 1, key 1 is replicated on nodes A, B and C.
When executing a query (GET or PUT), the client will connect to a node that will be the executor for this query. Given the preference list of the key, the coordinator will send the query to every node in the preference list. Only when a quorum has responded, the coordinator will respond to the client. Each application can choose the read and write quorum depending on their needs. If an application requires consistency, one could configure the write quorum W=N (where N is the number of partition) and R=1. Another application could use W=N-1 to be resiliant when one failure occurs.
To solve some consistency issues that could arise, DynamoDB records each copy of the value and the relationship between the updates. The relation between versions is recorded as a vector clock. When two versions diverge (like in figure 2), there is two conflict resolutions possibles:
- Business logic reconciliation where the client provides a custom reconciliation algorithm
- Timestamp reconciliation (aka last-writes wins).
When a node becomes unresponsive, DynamoDB uses the prefence list of the key to pick another node. When this node will store the value, it will also save the supposed target node. This way, as soon as the unresponsive node becomes responsive, the backup node send the key-value pairs that need to be updated.
To avoid useless synchronization, DynamoDB uses a merkle tree for each virtual node to reduce the network overhead while synchronizing.
When a node is down, it is only removed from the system via a manual intervention to avoid expensive synchronization when the failure is only transient.
Notifying nodes when a node is removed or added is done via a gossip protocol. When a node enters, the node assigns to itself multiple virtual nodes and notify every node in the cluster via the gossip protocol.
If you liked this summary, I highly encourage you to read the paper as it contains a lot of very interesting technical details that I couldn't go into.
To finish, here is a quote of the conclusion of the paper that describe DynamoDB perfectly:
This paper described Dynamo, a highly available and scalable data store, used for storing state of a number of core services of Amazon.com’s e-commerce platform. Dynamo has provided the desired levels of availability and performance and has been successful in handling server failures, data center failures and network partitions. Dynamo is incrementally scalable and allows service owners to scale up and down based on their current request load. Dynamo allows service owners to customize their storage system to meet their desired performance, durability and consistency SLAs by allowing them to tune the parameters N, R, and W
I hoped you liked this summary! If you have any constructive remarks, please share it with me! I am always eager to listen to other opinions.