This article is intended for software engineers with prior experience in development.
How to Approach System Design Interviews?
Think like a tech lead guiding junior engineers how to implement your design.
What interviewers want to see:
- base-level understanding of system design fundamentals
- back-and-forth about problem constraints and parameters of your service
- well-reasoned, qualified decisions based on engineering trade-offs
- unique direction your experience and decisions take them
- holistic view of a system and its users
1) API
REST
- APIs must be modelled based on the resources in the system. For instance, a single URL with HTTP verbs (GET, POST, PATCH, PUT, DELETE)
- Good: versioning, structured
- Bad: unneeded data also get fetched
RPC
- Write code that executes on another remote machine internally
- APIS are thought of as an action/command (ex.
/postAnOrder(OrderDetails order)
- Good: no special syntax to be learned, space-efficient
- Bad: only to be used for internal communication because of timing issues (it becomes challenging to distinguish concurrent multiple communications between machines)
GraphQL
- Data are structured in a graph relationships. Vertices (entities) and Edges (relationships)
- Good: ideal for customer-facing apps; you get what you ask; no more routing in backend to get and modify information
- Bad: less friendly to generate documentations like REST; not suitable for aggregate data
2) Databases (SQL vs NoSQL)
SQL
- composed of rows and tables
- strong ACID (emphasis: strong consistency)
- support powerful queries
- bad: writes are slow due to B-Trees splitting/merging pages/blocks.
NoSQL
- nested key-val store
- multiple writes can be easily handled
- emphasis: eventual consistency
- bad: reads might be stale for a couple of seconds (due to log-structured merge-tree)
Other types
- document-type (JSON)
- columnar-type (good for queries involving computing the same value types across multiple values)
- graph-type
3) Scaling (horizontal vs vertical)
Database scaling
- utilize replicas, then shard into separate databases. Sharding uses a hash function for even distribution and retrieval of entries.
Compute Scaling
divide a processing into pieces and designate each piece as a job in a queue so that multiple computers can work together in parallel.
both approaches may introduce some latency between calls/requests.
replicas ensures the reliability of a system by avoiding a single point of failure.
4) CAP Theorem
- In real world, it's impossible to achieve all three
- one of key fundamentals of distributed system design
Consistency
- every node in a network will have access toe the same data
Availability
- even if one or more nodes are down, any client making a data request receives a response
Partition Tolerance (necessary for modern systems)
- In case of a fault in a network or communication, the system will continue to work
5) Web Authentication and Basic Security
- It's all about the trade-offs between total safety and total convenience
- Authentication (JWT, session tokens/cookies) is about verifying identity, whereas authorization is allowing actions.
- For instance, user password can be secured with hashing and salting.
6) Load Balancers
- It's used to distribute traffic across machines (adding or removing servers in case of a failure).
- 3 common techniques: round-robin, least connections/response time, consistent hashing.
Round-Robin
- sends request to servers one by one
- can overload a server
- ideal when servers are stable and loads are random
Least Connections/Response Time
- ideal when servers with similar compute power and requests have varying connection time
Consistent Hashing
- install N number of virtual nodes for each server, so that loads are distributed as evenly as possible and only partial of the hash ring is affected when a server is added or removed.
7) Caching
- To reduce latency of an expensive network computation/network calls/database queries/asset fetching.
- Popular caching patterns: cache-aside, and write-through/write-back.
Cache-aside
- fetch from cache first, if not found, fetch from database, then cache it.
- data can become stale in cache if there's frequent write to the database. "Time-to-Live" can resolve it.
- Checking cache first might introduce extra latency.
Write-through and write-back
- Application writes data directly to the cache: asynchronously (write-back) or synchronously (write-through)
Write-back
- data goes into a queue and writes the data back to database.
Write-through
opposite of write-back. Hence synchronous workflow, it can slow down whole streaming process.
cache invalidation strategy: Least Recently Used (LRU)
8) Message Queues (Pub/Sub)
- beneficial if there can be a spike of traffic that potentially brings a server or a database down.
- queues can send requests to multiple servers/systems instead of clients sending the same request to multiple servers/systems.
- queues decouple the client from the server by eliminating the need to know the server address.
Common properties (based on implementations)
- guaranteed delivery
- no duplicate messages are delivered
- ensure that the order of messages is maintained
9) Indexing
- great for fetching a block of data from the hard disk to primary memory
- can be multi-levelled
- B-tree (self-adjusting; sorted order of pages)
10) Failover (active-passive or leader-follower)
- replications are used to avoid a single point of failure. It also helps a system serve global users across geographical locations/regions, and increases throughput.
leaders
- machine that handles write requests to the data-store
followers
- replicas of the leader that handles read requests
synchronous replication
- a write request to the followers must be acknowledged (by the leader machine). It slows down streaming, but ensures guaranteed delivery.
asynchronous replication
- opposite of synchronous replication.
less-time consuming, but no guarantee on delivery.
most common types of replication systems: single-leader, multi-leader (multiple machines can handle writes, but each needs to catch up with writes on other machines for consistency)
-
to resolve concurrent write conflicts:
- keep the update with the largest client timestamp
- sticky routing: writes from the same client go to the same leader
- keep all the updates and return all the updates from each other
Top comments (0)