Daniel Lee

Posted on Sep 13 • Edited on Dec 3

Key components to know for system design interview

#systemdesign #web #interview

This article is intended for software engineers with prior experience in development.

How to Approach System Design Interviews?

Think like a tech lead guiding junior engineers how to implement your design.

What interviewers want to see:

base-level understanding of system design fundamentals
back-and-forth about problem constraints and parameters of your service
well-reasoned, qualified decisions based on engineering trade-offs
unique direction your experience and decisions take them
holistic view of a system and its users

1) API

REST

APIs must be modelled based on the resources in the system. For instance, a single URL with HTTP verbs (GET, POST, PATCH, PUT, DELETE)
Good: versioning, structured
Bad: unneeded data also get fetched

RPC

Write code that executes on another remote machine internally
APIS are thought of as an action/command (ex. /postAnOrder(OrderDetails order)
Good: no special syntax to be learned, space-efficient
Bad: only to be used for internal communication because of timing issues (it becomes challenging to distinguish concurrent multiple communications between machines)

GraphQL

Data are structured in a graph relationships. Vertices (entities) and Edges (relationships)
Good: ideal for customer-facing apps; you get what you ask; no more routing in backend to get and modify information
Bad: less friendly to generate documentations like REST; not suitable for aggregate data

2) Databases (SQL vs NoSQL)

SQL

composed of rows and tables
strong ACID (emphasis: strong consistency)
support powerful queries
bad: writes are slow due to B-Trees splitting/merging pages/blocks.

NoSQL

nested key-val store
multiple writes can be easily handled
emphasis: eventual consistency
bad: reads might be stale for a couple of seconds (due to log-structured merge-tree)

Other types

document-type (JSON)
columnar-type (good for queries involving computing the same value types across multiple values)
graph-type

3) Scaling (horizontal vs vertical)

Database scaling

utilize replicas, then shard into separate databases. Sharding uses a hash function for even distribution and retrieval of entries.

Compute Scaling

divide a processing into pieces and designate each piece as a job in a queue so that multiple computers can work together in parallel.
both approaches may introduce some latency between calls/requests.
replicas ensures the reliability of a system by avoiding a single point of failure.

4) CAP Theorem

In real world, it's impossible to achieve all three
one of key fundamentals of distributed system design

Consistency

every node in a network will have access toe the same data

Availability

even if one or more nodes are down, any client making a data request receives a response

Partition Tolerance (necessary for modern systems)

In case of a fault in a network or communication, the system will continue to work

5) Web Authentication and Basic Security

It's all about the trade-offs between total safety and total convenience
Authentication (JWT, session tokens/cookies) is about verifying identity, whereas authorization is allowing actions.
For instance, user password can be secured with hashing and salting.

6) Load Balancers

It's used to distribute traffic across machines (adding or removing servers in case of a failure).
3 common techniques: round-robin, least connections/response time, consistent hashing.

Round-Robin

sends request to servers one by one
can overload a server
ideal when servers are stable and loads are random

Least Connections/Response Time

ideal when servers with similar compute power and requests have varying connection time

Consistent Hashing

install N number of virtual nodes for each server, so that loads are distributed as evenly as possible and only partial of the hash ring is affected when a server is added or removed.

7) Caching

To reduce latency of an expensive network computation/network calls/database queries/asset fetching.
Popular caching patterns: cache-aside, and write-through/write-back.

Cache-aside

fetch from cache first, if not found, fetch from database, then cache it.
data can become stale in cache if there's frequent write to the database. "Time-to-Live" can resolve it.
Checking cache first might introduce extra latency.

Write-through and write-back

Application writes data directly to the cache: asynchronously (write-back) or synchronously (write-through)

Write-back

data goes into a queue and writes the data back to database.

Write-through

opposite of write-back. Hence synchronous workflow, it can slow down whole streaming process.
cache invalidation strategy: Least Recently Used (LRU)

8) Message Queues (Pub/Sub)

beneficial if there can be a spike of traffic that potentially brings a server or a database down.
queues can send requests to multiple servers/systems instead of clients sending the same request to multiple servers/systems.
queues decouple the client from the server by eliminating the need to know the server address.

Common properties (based on implementations)

guaranteed delivery
no duplicate messages are delivered
ensure that the order of messages is maintained

9) Indexing

great for fetching a block of data from the hard disk to primary memory
can be multi-levelled
B-tree (self-adjusting; sorted order of pages)

10) Failover (active-passive or leader-follower)

replications are used to avoid a single point of failure. It also helps a system serve global users across geographical locations/regions, and increases throughput.

leaders

machine that handles write requests to the data-store

followers

replicas of the leader that handles read requests

synchronous replication

a write request to the followers must be acknowledged (by the leader machine). It slows down streaming, but ensures guaranteed delivery.

asynchronous replication

opposite of synchronous replication.
less-time consuming, but no guarantee on delivery.
most common types of replication systems: single-leader, multi-leader (multiple machines can handle writes, but each needs to catch up with writes on other machines for consistency)
to resolve concurrent write conflicts:
- keep the update with the largest client timestamp
- sticky routing: writes from the same client go to the same leader
- keep all the updates and return all the updates from each other