I've been learning the basics of tackling system design problems to better prepare myself for these types of questions in interviews. As a junior developer, system design interviews can be intimidating and difficult to feel adequately prepared for because of the open-ended nature and lack of experience/opportunities to put the skills into practice. Developing large-scale systems is no simple feat!
In previous posts, I've written about SQL vs. NoSQL and caching, which are small components to designing large-scale systems. Perhaps I'll make all of these posts into a series someday, titled something like "The (Very) Small Building Blocks of a System Design Interview Written For Beginners By a Beginner."
Thanks for coming to my TED talk! But I do have a topic I want to cover today: CAP Theorem. CAP stands for consistency, availability, and partition tolerance. CAP theorem is the concept that it is impossible for a distributed software system to guarantee all three properties; you can only have two of the three.
When designing your distributed data store, you will have to consider tradeoffs and the needs of your system so you can choose a data management system that will deliver on the properties your application needs most.
What's a distributed system? It's a network that stores data on more than one node (either a physical or virtual machine) at the same time. All cloud applications are distributed systems.
Clients will see the same data at the same time, regardless of which node they connect to.
To achieve this, when data is written to one node, it must be immediately replicated or forwarded to all of the other nodes in the system before the write is considered a "success".
If a client makes a request for data, they will get a response even if one or more nodes are down. All of the remaining working nodes in the system will return a valid response for any request.
This is achieved by replicating data across different servers.
A partition is when there is a message loss or partial failure between two nodes.
The system will continue to work despite the partition and can sustain any amount of network failure that doesn't result in a failure of the entire network.
To achieve this, data must be sufficiently replicated across a combination of nodes and networks to keep the system up and running through sporadic outages.
Non-relational/NoSQL databases are the best option for a distributed network application because of its horizontal scalability and distributed nature by design (can quickly scale across a vast network with a lot of interconnected nodes).
NoSQL databases can be classified by the two CAP properties they support:
CP databases will deliver on consistency and partition tolerance, sacrificing availability. A guarantee of availability cannot occur because if a partition happens between any nodes, the system will have to shut down the inconsistent node until the partition resolves.
MongoDB is a common NoSQL database that stores data as BSON (binary JSON) documents. It follows a single-master set up where only one primary node receives all the write operations, and the other nodes in the same replica set are secondary nodes that replicate the primary node's data. Typically used for big data and real-time apps running at several different locations, MongoDB follows the CP classification as it resolves network partitions by maintaining consistency at the expense of availability.
AP databases deliver on availability and partition tolerance, sacrificing consistency. A guarantee of consistency can't occur because if a partition happens, all nodes remain available but there is a chance that the affected node may deliver stale data.
Cassandra is another popular NoSQL database that stores data in a wide-column format. Without a master node (unlike Mongo), all nodes have to be available continuously, and as such it follows the AP classification. It is highly performant at the cost of consistency, though Cassandra does provide consistency eventually.
CA databases can deliver on consistency and availability, but cannot accommodate partition tolerance. CA databases are relational/SQL databases.
As with any of your system design needs, you will have to consider all of the benefits of the different database stores and tradeoffs that will work best for your problem. You should also clarify with the interviewer what type of guarantee they are looking for in the system. Talking through the tradeoffs at every stage is key in a system design interview. Understanding the basics is a great first step in getting more comfortable with system design questions.