Originally posted on my blog harrisgeo.me
This blogpost is aimed to put all of the things I have been asked in past system design interviews in one place. Before we start talking about it, let's grab a topic and use it to give some examples so that things make more sense.
olympics and you are asked to make a system for people to see all the
available games, see
details of the games they like and
buy tickets to go and see them.
We can create a web app to show all this information to the user. I personally feel more comfortable with
React but we can use whatever we want. The web app should have AT LEAST the following pages.
- all the available games
- details of each game
- buy ticket for a game
To display all the games and allow users to buy tickets we need to store our data into a Database. We can use a SQL database. MySQL is a good candidate to get started with. We need to design 2 tables. Let’s not talk about storing users yet.
# GAMES id name ... # Tickets id gameId price ...
We can create
RESTful apis for
reading / writing data into the DB. Using
serverless is a good and modern approach which will also make it easier for us to optimise the performance of these apis at a later stage. The endpoints are going to be
/api/gameswhich returns all of the games
/api/games/:idto show the specific game the user has selected
/api/games/:id/buyif we want to buy a new ticket
It will be much easier if we use a cloud provider to host our systems. AWS is perfect for that, as it provides us with a number of tools that we need to use. We can host our environment on
EC2, then use
Lambda for the serverless apis and
RDS for the Database.
We will use
Github as our version control of choice in order to store the code.
Github can also provide us with
github actions that can help us with part of the
Our app is up and running. With the current setup and given the fact that we are talking about a system that is planned to be used in the Olympics, it will crash in no time. How do we prevent that from happening and optimise our system to handle more load?
This site will receive A LOT of traffic. Hitting the server directly will be an overkill. That is why we need to introduce a load balancer using
NGINX or AWS's
ELB . Using load balancers is also the first row of defence against DDOS attacks.
In order for the load balancer to work, we need to add a few more instances of our web app. It will then split our traffic evenly amongst the extra instances we just added. This approach is called
round robin. Round robin is used on the fly to determine which one of the available instances has the least amount of traffic so that it is split evenly amongst all of our servers.
To optimise even further we can enable AWS's autoscaling. If we want more control we can
scale vertically where we increase the power of the hardware used in our instances. We can also
scale horizontally with multiple machines. As we talked in the load balancer section, more instances that work in parallel will boost amount of traffic the web app can handle.
Depending on our finances we can do either, or even a combination of these approaches to ensure our system can handle all these new users. At this stage we can introduce something like
Kubernetes in order to orchestrate all these different instances. That will gives us better control of how they all run together and manage them.
CND stands for Content Delivery Network and is responsible for prefetching the
When a connection is open it means that there is a thread waiting for this request to be completed. That leads to the server being limited on what else it can do during that delay which is slowing down everything else. In order to avoid that we can make use of a CDN like
Cloudflare or AWS's
A CDN usually keeps copies of its data in different locations around the world. That makes it faster for users to load their data. Another benefit of a CDN is that data is taken from the availability zone that is closer to their geographic location which minimises latency.
Our Database is very likely to be the first server that will crash once the traffic gets out of hand. We can do several things to prevent that but the first one is to introduce a
caching layer. Caching helps with not having to repeat the same requests over and over again. For that we can use
I personally prefer Redis as it is easy to setup and use. Redis will store data in memory (RAM). Caching usually occurs when users request information that has already recently been requested by others or will soon be requested by even more. That way when a new user wants to see some data that have already been fetched by others, the system will retrieve this information from Redis which will be quite faster than requesting them straight from the Database server.
Think about how many times users will request to see all the available games. This is information that is not likely to change often. Now with Redis we don't even need to touch the Database server in order to access that data.
The real challenge with caching though comes when we have to update or delete the cached information. One approach used for that is to setup our code to automatically update the cache once the user updates / adds information.
Another approach is based on studies of how long the users stay online on specific websites. After they leave the site to only then expire their cache. In most cases, caching is very specific to a company's problems and there's no "single solution that works for all" way to deal with it.
However, cache is not always going to be there for us. We would indeed end up touching the Database server, so optimising it is what is coming next.
Having one Database server for everything is an overkill. We need to somehow follow a similar approach as what we talked about earlier with parallel instances. In the Database world this is referred to as replication.
We can use the
master / slave replication pattern (refer to it
primary / secondary) which will split our Database into 2 replications. The
primary replication will only be used to
write data into it, whereas the
secondary only to
read from it. To keep data in sync, the primary replication keeps
cloning itself into the secondary.
That means that there might be a small delay (1-2 seconds) after the write is made. It is common in systems to have a small gap between these 2 actions. In most cases the user is redirected to another UI or URL which will cover this small delay. Even if that is not the case and users don't see the updated data instantly, they refresh the page.
In that pattern it is common to have more than one servers in the secondary replication so that traffic is equally split amongst each one of them. Even if the traffic gets that much higher we can replicate the primary replication. We can optimise even further with sharding.
In Databases it is common to see certain tables being much busier than others. In order to optimise that we can use sharding into our primary database replication which will split the Database into
smaller components amongst
different servers. These components are also called
shards and are faster and easier to manage.
We can apply our sharding
horizontally where we just separate the tables into different instances. Otherwise we can do it
vertically where we split the rows of the required tables. A common way of sharding vertically is by taking the
id of the user and
mod it by the number of machines. By doing that we end up with number the machine the user will be stored in.
We need to ensure that our code is not going to break the rest of the system when we deploy it. To do that we can first introduce several levels of testing.
Unit tests are good for testing the small pieces of code like functions, utils and React components. The next level is
integration tests which test bigger parts like if forms are submitted. The final level is to include
end to end tests which test entire journeys like if the user can fully register and then login etc.
We can also have
contract tests for making sure the client side api sends and receives the appropriate schema to and from the server side api. In case either side changes, the contract test will let us know about it so that we take action. Finally we can have
smoke tests which are quick checks to make sure that nothing major has been broken.
The good thing about automated tests is that even if a single one fails, they will prevent us from moving on. On the other hand though, if the test suit is not setup properly this can cause delays, so showing some love to the tests is really important for us developers.
CLI tools can be proven to really useful for catching the small details that can cause our code to fail in a later stage in the pipeline. Such tools can be
commit message linters and most importantly,
expecting all tests to pass.
A combination of these scripts can be included when we are about to commit or push our code and if any of them fails, that should prevent us that action from happening. CLI tools can be valueable when it comes to gatekeeping our code. They can ensure that new feature candidates meet the required quality standards before allowing users to use that code in production.
This is the stage that will allow us to make good use of everything that we mentioned in
CLI tools. Our deployment pipeline usually consists of running our tests and further checks in several preproduction stages aka environments. If the code is verified and running on these environments, only then we ship it into the final stage which is the production environment and make a new release.
Depending on how organisations work deploying to the final stage has to either be approved by a senior member of the team with
Continuous delivery or automate that even further and have it be automatically be deployed to production as well with
Continuous deployment. That minimises the chances of errors even further.
Continuous deployment makes releases "less risky" as we deploy more often (usually a few times per day). If we think about it, introducing a small chunks of code like a bug fix, is not that impactful to the system. At least not as much as releasing once every one or two weeks, which will introduce a lot more code to push and make things more complex to verify.
Our systems should be designed to handle failure. Our APIs can fail for a number of reasons which sometimes are unknown and not directly related to bad code. We need to have ways of preventing white pages or ugly web framework error pages from being shown to our users.
There are a couple of things we can do to prevent these from happening. One way is introduce
API retries which in many cases solves the problem without anyone noticing that something went wrong. If that does not work we need to have custom made error pages that inform the user that something went wrong and they need to either try again shortly or contact the company’s support. We have all seen and love Github's "this is not the web page you're looking" error page. The users will feel that we have things under control. (even if we do not 🔥)
No matter how much we test our code, errors can still occur. The first step to be aware of, is if our servers are out of service. Healthchecks can the first thing to check in order to find out if any of our servers is down.
Healthchecks can be ping endpoints which are very light and simple for the server to process. My favourite one is the
Ping? Pong! approach in which we have a
/ping endpoint. After calling it, it will respond with
pong and that tells us that if the server responded, then it is up and running. We can also call calculate the time it took between the request and the response to get a brief idea of how healthy the server is.
This is where we talk about
metrics and issues that occur in bigger scale systems. Our system is currently quite complex and it won't come as a surprise if there are several errors the users are facing or even a service being down and we only find we out about it after a few hours.
We need tools that will help us identify that something has gone wrong as soon as it happens and also to know what this something is. A common pattern is the the LMA stack which stands for
logs is a major help when it comes to finding where errors originate from. I am a big fan of adding a
trail of info logs which makes it easier for us to spot where the error occurred. We can do that by following the trails until the point they stop which will give us a more accurate idea of where the error probably originated from.
A common issue in systems is that errors occur and even after reading what the error message says, we still have no idea what that is about. It is the equivalent of a developer asking for help by saying that x doesn't work without providing more context. 😂
Detailed errors like
descriptive error messages are our allies. When we see an error we should instantly at least understand what it is about and such messages really point us to the right direction.
HTTP error codes is also another approach which gives us an instant idea about the nature of the error without even looking at the error message. If the error is something the client "did wrong" then the error code should be a
4xx. On the other hand if the error is something that went wrong on the server then it should be
Another issue that may arise here is that different systems may have different locations where their logs are stored. It can be quite painful to have to search 10 different locations to find our logs because of the 10 services we currently have up and running have different setups. Tools like
AWS Cloudwatch are perfect for gathering all the logs in one place and save ourselves from further headaches.
Using a combination of the logging points made above will definitely help us troubleshoot and assist us with finding the error "without much pain". However, the problem is that logging is useful when we know that something went wrong. How do we find out about what went wrong in the first place?
Alerting is a mechanism that informs us every time our
system throws an error. That can be anything and this is probably the only way that we can detect edge cases errors like e.g. an error that was introduced on the latest version of Opera that is interfering with one of our pages on an second generation iPad.
While it is not always possible to instantly fix such errors, it is very important to be aware of them. Alerting can easily become too noisy which will make it difficult for the engineer who is on call to know what should they pay attention to. Tools like
DataDog are commonly used for that.
So the real challenge here is to be able to monitor which errors are "worth investigating". Issues like
availability (SLIs) are stuff related to what the user experiences should be given attention. When a message is sent to the engineer, they can go to the monitoring tools for more information.
One important thing to be aware of is that 100% availability of a system is possible but really really expensive. Even big companies out there have 97% or even less availability. What if there is a planned maintenance? This is a big topic maybe we can dig deeper into it in a different blogpost.
Server metrics like
network activity and
disk space are sent to our monitoring tool of choice. Monitoring visualisation tools like
Graphana and other dashboards can help us see all that information in graphs etc that is easier and faster to read in order to find out what is going on.
In many cases monitoring tools will provide CLI commands which will automatically search for common causes and as long as they are able to take action by themselves they will prevent from waking up a human in order to go and fix it.
There are sooo many concepts to go through where it is almost impossible to cover all of them in one blog post! I even believe that there are so many more that I'm not even aware about of their existence today. This guide will definitely give you a much clearer understanding of the big picture behind designing systems.
The thing I learned the hard way in interviews is that giving an overview of my approach and then asking which topic to go deeper into works much quite well as it makes the conversation more interactive.
At least better than going deep into topics from the very beginning without explaining what is our whole plan. In general, being involved into architecture or full stack development can really complicate things.
This blog post originally started as notes to check before interviews but soon enough I realised that this information is worth sharing with the community and help others. Who knows? Maybe there is going to be a part 2 as there are quite a few topics we did not talk about.