Originally posted on my blog harrisgeo.me
Before we begin
This blogpost is aimed to put all of the things I have been asked in past system design interviews in one place. Before we start talking about it, let's grab a topic and use it to give some examples so that things make more sense.
Enjoy πΏ
The Question!!!
It's the olympics
and you are asked to make a system for people to see all the available games
, see details
of the games they like and buy tickets
to go and see them.
Answer Part 1 the overview
Web App
We can create a web app to show all this information to the user. I personally feel more comfortable with React
but we can use whatever we want. The web app should have AT LEAST the following pages.
- all the available games
/games
- details of each game
/games/{gameId}
- buy ticket for a game
/games/{gameId}/buy
Database
To display all the games and allow users to buy tickets we need to store our data into a Database. We can use a SQL database. MySQL is a good candidate to get started with. We need to design 2 tables. Letβs not talk about storing users yet.
# GAMES
id
name
...
# Tickets
id
gameId
price
...
API
We can create RESTful
apis for reading / writing
data into the DB. Using serverless
is a good and modern approach which will also make it easier for us to optimise the performance of these apis at a later stage. The endpoints are going to be
- GET
/api/games
which returns all of the games - GET
/api/games/:id
to show the specific game the user has selected - POST
/api/games/:id/buy
if we want to buy a new ticket
Hosting
It will be much easier if we use a cloud provider to host our systems. AWS is perfect for that, as it provides us with a number of tools that we need to use. We can host our environment on EC2
, then use Lambda
for the serverless apis and RDS
for the Database.
Version control
We will use Github
as our version control of choice in order to store the code. Github
can also provide us with github actions
that can help us with part of the CI/CD
process.
Our app is up and running. With the current setup and given the fact that we are talking about a system that is planned to be used in the Olympics, it will crash in no time. How do we prevent that from happening and optimise our system to handle more load?
Part 2 the optimisation
Load balancer
This site will receive A LOT of traffic. Hitting the server directly will be an overkill. That is why we need to introduce a load balancer using NGINX
or AWS's ELB
. Using load balancers is also the first row of defence against DDOS attacks.
In order for the load balancer to work, we need to add a few more instances of our web app. It will then split our traffic evenly amongst the extra instances we just added. This approach is called round robin
. Round robin is used on the fly to determine which one of the available instances has the least amount of traffic so that it is split evenly amongst all of our servers.
Scaling instances
To optimise even further we can enable AWS's autoscaling. If we want more control we can scale vertically
where we increase the power of the hardware used in our instances. We can also scale horizontally
with multiple machines. As we talked in the load balancer section, more instances that work in parallel will boost amount of traffic the web app can handle.
Depending on our finances we can do either, or even a combination of these approaches to ensure our system can handle all these new users. At this stage we can introduce something like Kubernetes
in order to orchestrate all these different instances. That will gives us better control of how they all run together and manage them.
Use a CDN
CND stands for Content Delivery Network and is responsible for prefetching the static assets
of our web app. Such assets can be images, fonts, css files, Javascript files and more. Using these files leads to having connections open for long periods of time.
When a connection is open it means that there is a thread waiting for this request to be completed. That leads to the server being limited on what else it can do during that delay which is slowing down everything else. In order to avoid that we can make use of a CDN like Cloudflare
or AWS's CloudFront
.
A CDN usually keeps copies of its data in different locations around the world. That makes it faster for users to load their data. Another benefit of a CDN is that data is taken from the availability zone that is closer to their geographic location which minimises latency.
Database
Caching
Our Database is very likely to be the first server that will crash once the traffic gets out of hand. We can do several things to prevent that but the first one is to introduce a caching layer
. Caching helps with not having to repeat the same requests over and over again. For that we can use Redis
, Memcache
or Cassandra
.
I personally prefer Redis as it is easy to setup and use. Redis will store data in memory (RAM). Caching usually occurs when users request information that has already recently been requested by others or will soon be requested by even more. That way when a new user wants to see some data that have already been fetched by others, the system will retrieve this information from Redis which will be quite faster than requesting them straight from the Database server.
Think about how many times users will request to see all the available games. This is information that is not likely to change often. Now with Redis we don't even need to touch the Database server in order to access that data.
The real challenge with caching though comes when we have to update or delete the cached information. One approach used for that is to setup our code to automatically update the cache once the user updates / adds information.
Another approach is based on studies of how long the users stay online on specific websites. After they leave the site to only then expire their cache. In most cases, caching is very specific to a company's problems and there's no "single solution that works for all" way to deal with it.
However, cache is not always going to be there for us. We would indeed end up touching the Database server, so optimising it is what is coming next.
Replication
Having one Database server for everything is an overkill. We need to somehow follow a similar approach as what we talked about earlier with parallel instances. In the Database world this is referred to as replication.
We can use the master / slave
replication pattern (refer to it primary / secondary
) which will split our Database into 2 replications. The primary
replication will only be used to write
data into it, whereas the secondary
only to read
from it. To keep data in sync, the primary replication keeps cloning
itself into the secondary.
That means that there might be a small delay (1-2 seconds) after the write is made. It is common in systems to have a small gap between these 2 actions. In most cases the user is redirected to another UI or URL which will cover this small delay. Even if that is not the case and users don't see the updated data instantly, they refresh the page.
In that pattern it is common to have more than one servers in the secondary replication so that traffic is equally split amongst each one of them. Even if the traffic gets that much higher we can replicate the primary replication. We can optimise even further with sharding.
Sharding
In Databases it is common to see certain tables being much busier than others. In order to optimise that we can use sharding into our primary database replication which will split the Database into smaller components
amongst different servers
. These components are also called shards
and are faster and easier to manage.
We can apply our sharding horizontally
where we just separate the tables into different instances. Otherwise we can do it vertically
where we split the rows of the required tables. A common way of sharding vertically is by taking the id
of the user and mod
it by the number of machines. By doing that we end up with number the machine the user will be stored in.
Quality control
Testing
We need to ensure that our code is not going to break the rest of the system when we deploy it. To do that we can first introduce several levels of testing. Unit tests
are good for testing the small pieces of code like functions, utils and React components. The next level is integration tests
which test bigger parts like if forms are submitted. The final level is to include end to end tests
which test entire journeys like if the user can fully register and then login etc.
We can also have contract tests
for making sure the client side api sends and receives the appropriate schema to and from the server side api. In case either side changes, the contract test will let us know about it so that we take action. Finally we can have smoke tests
which are quick checks to make sure that nothing major has been broken.
The good thing about automated tests is that even if a single one fails, they will prevent us from moving on. On the other hand though, if the test suit is not setup properly this can cause delays, so showing some love to the tests is really important for us developers.
CLI tools
CLI tools can be proven to really useful for catching the small details that can cause our code to fail in a later stage in the pipeline. Such tools can be linters
, build tools
, prettier
, pre-commit hooks
, commit message linters
and most importantly, expecting all tests to pass
.
A combination of these scripts can be included when we are about to commit or push our code and if any of them fails, that should prevent us that action from happening. CLI tools can be valueable when it comes to gatekeeping our code. They can ensure that new feature candidates meet the required quality standards before allowing users to use that code in production.
CI / CD
This is the stage that will allow us to make good use of everything that we mentioned in testing
and CLI tools
. Our deployment pipeline usually consists of running our tests and further checks in several preproduction stages aka environments. If the code is verified and running on these environments, only then we ship it into the final stage which is the production environment and make a new release.
Depending on how organisations work deploying to the final stage has to either be approved by a senior member of the team with Continuous delivery
or automate that even further and have it be automatically be deployed to production as well with Continuous deployment
. That minimises the chances of errors even further.
Continuous deployment makes releases "less risky" as we deploy more often (usually a few times per day). If we think about it, introducing a small chunks of code like a bug fix, is not that impactful to the system. At least not as much as releasing once every one or two weeks, which will introduce a lot more code to push and make things more complex to verify.
API failures
Our systems should be designed to handle failure. Our APIs can fail for a number of reasons which sometimes are unknown and not directly related to bad code. We need to have ways of preventing white pages or ugly web framework error pages from being shown to our users.
There are a couple of things we can do to prevent these from happening. One way is introduce API retries
which in many cases solves the problem without anyone noticing that something went wrong. If that does not work we need to have custom made error pages that inform the user that something went wrong and they need to either try again shortly or contact the companyβs support. We have all seen and love Github's "this is not the web page you're looking" error page. The users will feel that we have things under control. (even if we do not π₯)
Healthchecks
No matter how much we test our code, errors can still occur. The first step to be aware of, is if our servers are out of service. Healthchecks can the first thing to check in order to find out if any of our servers is down.
Healthchecks can be ping endpoints which are very light and simple for the server to process. My favourite one is the Ping? Pong!
approach in which we have a /ping
endpoint. After calling it, it will respond with pong
and that tells us that if the server responded, then it is up and running. We can also call calculate the time it took between the request and the response to get a brief idea of how healthy the server is.
Answer part 3 monitoring tools
This is where we talk about metrics
and issues that occur in bigger scale systems. Our system is currently quite complex and it won't come as a surprise if there are several errors the users are facing or even a service being down and we only find we out about it after a few hours.
We need tools that will help us identify that something has gone wrong as soon as it happens and also to know what this something is. A common pattern is the the LMA stack which stands for Logging
, Monitoring
and Alerting
.
Logging
Having logs
is a major help when it comes to finding where errors originate from. I am a big fan of adding a trail of info logs
which makes it easier for us to spot where the error occurred. We can do that by following the trails until the point they stop which will give us a more accurate idea of where the error probably originated from.
A common issue in systems is that errors occur and even after reading what the error message says, we still have no idea what that is about. It is the equivalent of a developer asking for help by saying that x doesn't work without providing more context. π
Detailed errors like descriptive error messages
are our allies. When we see an error we should instantly at least understand what it is about and such messages really point us to the right direction.
Using HTTP error codes
is also another approach which gives us an instant idea about the nature of the error without even looking at the error message. If the error is something the client "did wrong" then the error code should be a 4xx
. On the other hand if the error is something that went wrong on the server then it should be 5xx
.
Another issue that may arise here is that different systems may have different locations where their logs are stored. It can be quite painful to have to search 10 different locations to find our logs because of the 10 services we currently have up and running have different setups. Tools like Datadog
or AWS Cloudwatch
are perfect for gathering all the logs in one place and save ourselves from further headaches.
Using a combination of the logging points made above will definitely help us troubleshoot and assist us with finding the error "without much pain". However, the problem is that logging is useful when we know that something went wrong. How do we find out about what went wrong in the first place?
Alerting
Alerting is a mechanism that informs us every time our system throws an error
. That can be anything and this is probably the only way that we can detect edge cases errors like e.g. an error that was introduced on the latest version of Opera that is interfering with one of our pages on an second generation iPad.
While it is not always possible to instantly fix such errors, it is very important to be aware of them. Alerting can easily become too noisy which will make it difficult for the engineer who is on call to know what should they pay attention to. Tools like PagerDuty
or DataDog
are commonly used for that.
So the real challenge here is to be able to monitor which errors are "worth investigating". Issues like latency
and availability
(SLIs) are stuff related to what the user experiences should be given attention. When a message is sent to the engineer, they can go to the monitoring tools for more information.
One important thing to be aware of is that 100% availability of a system is possible but really really expensive. Even big companies out there have 97% or even less availability. What if there is a planned maintenance? This is a big topic maybe we can dig deeper into it in a different blogpost.
Monitoring visualization tools
Server metrics like CPU
, memory
, network activity
and disk space
are sent to our monitoring tool of choice. Monitoring visualisation tools like Graphana
and other dashboards can help us see all that information in graphs etc that is easier and faster to read in order to find out what is going on.
In many cases monitoring tools will provide CLI commands which will automatically search for common causes and as long as they are able to take action by themselves they will prevent from waking up a human in order to go and fix it.
Are we done yet? π₯΅
There are sooo many concepts to go through where it is almost impossible to cover all of them in one blog post! I even believe that there are so many more that I'm not even aware about of their existence today. This guide will definitely give you a much clearer understanding of the big picture behind designing systems.
The thing I learned the hard way in interviews is that giving an overview of my approach and then asking which topic to go deeper into works much quite well as it makes the conversation more interactive.
At least better than going deep into topics from the very beginning without explaining what is our whole plan. In general, being involved into architecture or full stack development can really complicate things.
This blog post originally started as notes to check before interviews but soon enough I realised that this information is worth sharing with the community and help others. Who knows? Maybe there is going to be a part 2 as there are quite a few topics we did not talk about.
Top comments (0)