The CAP theorem explains the trade-offs between three important factors in distributed systems: Consistency (C), Availability (A), and Partition Tolerance (P).
It was first introduced by Eric Brewer in 2000, which is why it's also known as Brewer's Theorem.
What is a distributed system? (prerequisite)
A distributed system is a group of servers (computers) that work together to solve a problem or provide service. To the end user, it looks like a single system. These servers are also referred as nodes in a distributed system.
Consistency (C)
- We want our systems to be consistent; this means every read operation should return the updated result. In other words, all the nodes in the system should reflect the same data at any given time. This ensures that all users see the same data.
Availability (A)
- Our system should remain operational, even if some nodes are down or unreachable. There can be some delay or inconsistency in data, but the system should return a response for every request (read or write).
Partition Tolerance (P)
- The system should continue functioning despite some nodes becoming isolated or disconnected due to network issues. However, the system must sacrifice consistency or availability in such situations.
This theorem states that when a network partition occurs in a distributed system, you can only achieve two out of the three properties at the same time: Consistency, Availability, and Partition Tolerance.
However, in 2012, Dr. Brewer clarified that the "two out of three" concept can be somewhat misleading because system designers only need to sacrifice consistency or availability as partition management and recovery techniques exist.
Consistency + Partition Tolerance (CP)
In a CP system, consistency is prioritized over availability.
How it works:
- During a network partition, some nodes might not be able to communicate with each other. So, if a node cannot confirm the latest state with other nodes, it will reject or delay the request instead of serving potentially outdated data until it can synchronize again with other nodes. Here, the availability of the system is sacrificed by rejecting or blocking the requests. Real-world examples:
- Financial Transactions: Banks or stock markets need to prioritize data consistency, especially when handling transactions. If the system detects a partition, it will block transaction requests so that balances remain accurate.
- Inventory Systems: E-commerce platforms need to ensure no overselling occurs during inventory checks. When a partition happens, they prefer to reject the user requests rather than overselling the same product.
Availability + Partition Tolerance (AP)
In the AP system, availability is prioritized over consistency.
How it works:
- During a network partition, some nodes might not be able to communicate with each other. So even if a node cannot confirm the latest state with other nodes, it will continue to serve requests. Here, the consistency of the system is sacrificed, but it is eventually resolved when the partition is fixed and nodes are synchronized again. Real-world examples:
- Social Media Platforms: Systems like Facebook or Twitter may prioritize availability to ensure users can always post, comment, and message even if some data may be slightly out of sync.
- Content Deliver Networks (CDNs): A CDN must always be available to serve content quickly, even if some nodes are not yet synchronized with the latest version of data.
Decision Making with CAP Theorem
Scenario: Global Media Streaming Service
Product requirement: The system needs to serve video content to millions of users worldwide. It should handle network partitions without downtime, but minor inconsistencies like short delays in updating watch history are acceptable.
Decision: An AP system could be useful for this scenario since availability is critical to avoid interruptions in video streaming.
Solution: Cassandra or DynamoDB, as it was built to prioritize high availability and partition tolerance by replicating data across multiple nodes/regions.Scenario: Online multiplayer game
Product requirement: The game world must remain available at all times. Some temporary inconsistency is acceptable (e.g., slight lag or delayed updates).
Solution: Redis + Cassandra.
Redis ensures game state caching and fast real-time operations (e.g., chat, session management), while Cassandra provides highly available game data (e.g., player profiles and game events).Scenario: Parking management application
Product requirement: Manage core data like parking lots, reservations, payments, and user accounts. Also, it needs to ensure that no two customers reserve the same parking slot.
Solution: Redis + PostgreSQL.
PostgreSQL maintains Consistency and Partition Tolerance, ensuring the integrity of transactions, especially for critical operations like reservations and payments. Redis enhances Availability and Partition Tolerance, allowing real-time data retrieval (e.g., current parking availability) even during network issues.
Eventual Consistency in Real Systems
Some AP systems use eventual consistency to reconcile differences after a partition.
Example: In Amazon’s shopping cart, items added during a network partition may not appear instantly but are synchronized later.
Key takeaway: Eventual consistency allows systems to be available during partitions while still achieving correctness over time.
Beyond CAP: PACELC Theorem
While CAP focuses on handling partitions, the PACELC theorem adds nuance by addressing trade-offs during normal operations. It states that:
- If a partition occurs, a system must trade-off between Availability or Consistency (like CAP).
- Else, under normal conditions, the system must trade-off between Latency or Consistency. Example: In streaming services, low latency is prioritized, while in financial services, strong consistency is crucial. PACELC highlights the importance of balancing both partition management and latency considerations.
Trade-offs in System Design: Practical Implications
Designing distributed systems involves balancing availability, consistency, and partition tolerance based on product requirements.
- Cost: Strongly consistent systems are difficult to scale and maintain.
- User Experience: Prioritizing availability can improve uptime but might frustrate users with outdated data.
Top comments (0)