Harris Geo 👨🏻‍💻

Posted on Aug 24, 2020

The system design interview: An introduction to system architecture

#interview #systemdesign #career #growth

Originally posted on my blog harrisgeo.me

Before we begin

This blogpost is aimed to put all of the things I have been asked in past system design interviews in one place. Before we start talking about it, let's grab a topic and use it to give some examples so that things make more sense.

Enjoy 🍿

The Question!!!

It's the olympics and you are asked to make a system for people to see all the available games, see details of the games they like and buy tickets to go and see them.

Answer Part 1 the overview

Web App

We can create a web app to show all this information to the user. I personally feel more comfortable with React but we can use whatever we want. The web app should have AT LEAST the following pages.

all the available games /games
details of each game /games/{gameId}
buy ticket for a game /games/{gameId}/buy

Database

To display all the games and allow users to buy tickets we need to store our data into a Database. We can use a SQL database. MySQL is a good candidate to get started with. We need to design 2 tables. Let’s not talk about storing users yet.

# GAMES
id
name
...

# Tickets
id
gameId
price
...

API

We can create RESTful apis for reading / writing data into the DB. Using serverless is a good and modern approach which will also make it easier for us to optimise the performance of these apis at a later stage. The endpoints are going to be

GET /api/games which returns all of the games
GET /api/games/:id to show the specific game the user has selected
POST /api/games/:id/buy if we want to buy a new ticket

Hosting

It will be much easier if we use a cloud provider to host our systems. AWS is perfect for that, as it provides us with a number of tools that we need to use. We can host our environment on EC2, then use Lambda for the serverless apis and RDS for the Database.

Version control

We will use Github as our version control of choice in order to store the code. Github can also provide us with github actions that can help us with part of the CI/CD process.

Our app is up and running. With the current setup and given the fact that we are talking about a system that is planned to be used in the Olympics, it will crash in no time. How do we prevent that from happening and optimise our system to handle more load?

Part 2 the optimisation

Load balancer

This site will receive A LOT of traffic. Hitting the server directly will be an overkill. That is why we need to introduce a load balancer using NGINX or AWS's ELB . Using load balancers is also the first row of defence against DDOS attacks.

In order for the load balancer to work, we need to add a few more instances of our web app. It will then split our traffic evenly amongst the extra instances we just added. This approach is called round robin. Round robin is used on the fly to determine which one of the available instances has the least amount of traffic so that it is split evenly amongst all of our servers.

Scaling instances

To optimise even further we can enable AWS's autoscaling. If we want more control we can scale vertically where we increase the power of the hardware used in our instances. We can also scale horizontally with multiple machines. As we talked in the load balancer section, more instances that work in parallel will boost amount of traffic the web app can handle.

Depending on our finances we can do either, or even a combination of these approaches to ensure our system can handle all these new users. At this stage we can introduce something like Kubernetes in order to orchestrate all these different instances. That will gives us better control of how they all run together and manage them.

Use a CDN

CND stands for Content Delivery Network and is responsible for prefetching the static assets of our web app. Such assets can be images, fonts, css files, Javascript files and more. Using these files leads to having connections open for long periods of time.

When a connection is open it means that there is a thread waiting for this request to be completed. That leads to the server being limited on what else it can do during that delay which is slowing down everything else. In order to avoid that we can make use of a CDN like Cloudflare or AWS's CloudFront.

A CDN usually keeps copies of its data in different locations around the world. That makes it faster for users to load their data. Another benefit of a CDN is that data is taken from the availability zone that is closer to their geographic location which minimises latency.

Database

Caching

Our Database is very likely to be the first server that will crash once the traffic gets out of hand. We can do several things to prevent that but the first one is to introduce a caching layer. Caching helps with not having to repeat the same requests over and over again. For that we can use Redis, Memcache or Cassandra.

I personally prefer Redis as it is easy to setup and use. Redis will store data in memory (RAM). Caching usually occurs when users request information that has already recently been requested by others or will soon be requested by even more. That way when a new user wants to see some data that have already been fetched by others, the system will retrieve this information from Redis which will be quite faster than requesting them straight from the Database server.

Think about how many times users will request to see all the available games. This is information that is not likely to change often. Now with Redis we don't even need to touch the Database server in order to access that data.

The real challenge with caching though comes when we have to update or delete the cached information. One approach used for that is to setup our code to automatically update the cache once the user updates / adds information.

Another approach is based on studies of how long the users stay online on specific websites. After they leave the site to only then expire their cache. In most cases, caching is very specific to a company's problems and there's no "single solution that works for all" way to deal with it.

However, cache is not always going to be there for us. We would indeed end up touching the Database server, so optimising it is what is coming next.

Replication

Having one Database server for everything is an overkill. We need to somehow follow a similar approach as what we talked about earlier with parallel instances. In the Database world this is referred to as replication.

We can use the master / slave replication pattern (refer to it primary / secondary) which will split our Database into 2 replications. The primary replication will only be used to write data into it, whereas the secondary only to read from it. To keep data in sync, the primary replication keeps cloning itself into the secondary.

That means that there might be a small delay (1-2 seconds) after the write is made. It is common in systems to have a small gap between these 2 actions. In most cases the user is redirected to another UI or URL which will cover this small delay. Even if that is not the case and users don't see the updated data instantly, they refresh the page.

In that pattern it is common to have more than one servers in the secondary replication so that traffic is equally split amongst each one of them. Even if the traffic gets that much higher we can replicate the primary replication. We can optimise even further with sharding.

Sharding

In Databases it is common to see certain tables being much busier than others. In order to optimise that we can use sharding into our primary database replication which will split the Database into smaller components amongst different servers. These components are also called shards and are faster and easier to manage.

We can apply our sharding horizontally where we just separate the tables into different instances. Otherwise we can do it vertically where we split the rows of the required tables. A common way of sharding vertically is by taking the id of the user and mod it by the number of machines. By doing that we end up with number the machine the user will be stored in.

Quality control

Testing

We need to ensure that our code is not going to break the rest of the system when we deploy it. To do that we can first introduce several levels of testing. Unit tests are good for testing the small pieces of code like functions, utils and React components. The next level is integration tests which test bigger parts like if forms are submitted. The final level is to include end to end tests which test entire journeys like if the user can fully register and then login etc.

We can also have contract tests for making sure the client side api sends and receives the appropriate schema to and from the server side api. In case either side changes, the contract test will let us know about it so that we take action. Finally we can have smoke tests which are quick checks to make sure that nothing major has been broken.

The good thing about automated tests is that even if a single one fails, they will prevent us from moving on. On the other hand though, if the test suit is not setup properly this can cause delays, so showing some love to the tests is really important for us developers.

CLI tools

CLI tools can be proven to really useful for catching the small details that can cause our code to fail in a later stage in the pipeline. Such tools can be linters, build tools, prettier, pre-commit hooks, commit message linters and most importantly, expecting all tests to pass.

A combination of these scripts can be included when we are about to commit or push our code and if any of them fails, that should prevent us that action from happening. CLI tools can be valueable when it comes to gatekeeping our code. They can ensure that new feature candidates meet the required quality standards before allowing users to use that code in production.

CI / CD

This is the stage that will allow us to make good use of everything that we mentioned in testing and CLI tools. Our deployment pipeline usually consists of running our tests and further checks in several preproduction stages aka environments. If the code is verified and running on these environments, only then we ship it into the final stage which is the production environment and make a new release.

Depending on how organisations work deploying to the final stage has to either be approved by a senior member of the team with Continuous delivery or automate that even further and have it be automatically be deployed to production as well with Continuous deployment. That minimises the chances of errors even further.

Continuous deployment makes releases "less risky" as we deploy more often (usually a few times per day). If we think about it, introducing a small chunks of code like a bug fix, is not that impactful to the system. At least not as much as releasing once every one or two weeks, which will introduce a lot more code to push and make things more complex to verify.

API failures

Our systems should be designed to handle failure. Our APIs can fail for a number of reasons which sometimes are unknown and not directly related to bad code. We need to have ways of preventing white pages or ugly web framework error pages from being shown to our users.

There are a couple of things we can do to prevent these from happening. One way is introduce API retries which in many cases solves the problem without anyone noticing that something went wrong. If that does not work we need to have custom made error pages that inform the user that something went wrong and they need to either try again shortly or contact the company’s support. We have all seen and love Github's "this is not the web page you're looking" error page. The users will feel that we have things under control. (even if we do not 🔥)

Healthchecks

No matter how much we test our code, errors can still occur. The first step to be aware of, is if our servers are out of service. Healthchecks can the first thing to check in order to find out if any of our servers is down.

Healthchecks can be ping endpoints which are very light and simple for the server to process. My favourite one is the Ping? Pong! approach in which we have a /ping endpoint. After calling it, it will respond with pong and that tells us that if the server responded, then it is up and running. We can also call calculate the time it took between the request and the response to get a brief idea of how healthy the server is.

Answer part 3 monitoring tools

This is where we talk about metrics and issues that occur in bigger scale systems. Our system is currently quite complex and it won't come as a surprise if there are several errors the users are facing or even a service being down and we only find we out about it after a few hours.

We need tools that will help us identify that something has gone wrong as soon as it happens and also to know what this something is. A common pattern is the the LMA stack which stands for Logging, Monitoring and Alerting.

Logging

Having logs is a major help when it comes to finding where errors originate from. I am a big fan of adding a trail of info logs which makes it easier for us to spot where the error occurred. We can do that by following the trails until the point they stop which will give us a more accurate idea of where the error probably originated from.

A common issue in systems is that errors occur and even after reading what the error message says, we still have no idea what that is about. It is the equivalent of a developer asking for help by saying that x doesn't work without providing more context. 😂

Detailed errors like descriptive error messages are our allies. When we see an error we should instantly at least understand what it is about and such messages really point us to the right direction.

Using HTTP error codes is also another approach which gives us an instant idea about the nature of the error without even looking at the error message. If the error is something the client "did wrong" then the error code should be a 4xx. On the other hand if the error is something that went wrong on the server then it should be 5xx .

Another issue that may arise here is that different systems may have different locations where their logs are stored. It can be quite painful to have to search 10 different locations to find our logs because of the 10 services we currently have up and running have different setups. Tools like Datadog or AWS Cloudwatch are perfect for gathering all the logs in one place and save ourselves from further headaches.

Using a combination of the logging points made above will definitely help us troubleshoot and assist us with finding the error "without much pain". However, the problem is that logging is useful when we know that something went wrong. How do we find out about what went wrong in the first place?

Alerting

Alerting is a mechanism that informs us every time our system throws an error. That can be anything and this is probably the only way that we can detect edge cases errors like e.g. an error that was introduced on the latest version of Opera that is interfering with one of our pages on an second generation iPad.

While it is not always possible to instantly fix such errors, it is very important to be aware of them. Alerting can easily become too noisy which will make it difficult for the engineer who is on call to know what should they pay attention to. Tools like PagerDuty or DataDog are commonly used for that.

So the real challenge here is to be able to monitor which errors are "worth investigating". Issues like latency and availability (SLIs) are stuff related to what the user experiences should be given attention. When a message is sent to the engineer, they can go to the monitoring tools for more information.

One important thing to be aware of is that 100% availability of a system is possible but really really expensive. Even big companies out there have 97% or even less availability. What if there is a planned maintenance? This is a big topic maybe we can dig deeper into it in a different blogpost.

Monitoring visualization tools

Server metrics like CPU, memory, network activity and disk space are sent to our monitoring tool of choice. Monitoring visualisation tools like Graphana and other dashboards can help us see all that information in graphs etc that is easier and faster to read in order to find out what is going on.

In many cases monitoring tools will provide CLI commands which will automatically search for common causes and as long as they are able to take action by themselves they will prevent from waking up a human in order to go and fix it.

Are we done yet? 🥵

There are sooo many concepts to go through where it is almost impossible to cover all of them in one blog post! I even believe that there are so many more that I'm not even aware about of their existence today. This guide will definitely give you a much clearer understanding of the big picture behind designing systems.

The thing I learned the hard way in interviews is that giving an overview of my approach and then asking which topic to go deeper into works much quite well as it makes the conversation more interactive.

At least better than going deep into topics from the very beginning without explaining what is our whole plan. In general, being involved into architecture or full stack development can really complicate things.

This blog post originally started as notes to check before interviews but soon enough I realised that this information is worth sharing with the community and help others. Who knows? Maybe there is going to be a part 2 as there are quite a few topics we did not talk about.

DEV Community

The system design interview: An introduction to system architecture

Before we begin

The Question!!!

Answer Part 1 the overview

Web App

Database

API

Hosting

Version control

Part 2 the optimisation

Load balancer

Scaling instances

Use a CDN

Database

Caching

Replication

Sharding

Quality control

Testing

CLI tools

CI / CD

API failures

Healthchecks

Answer part 3 monitoring tools

Logging

Alerting

Monitoring visualization tools

Are we done yet? 🥵

Top comments (0)

Read next

Understanding the CAP Theorem in Distributed Systems

𝗛𝗼𝘄 𝗕𝗮𝗰𝗸𝗲𝗻𝗱 𝗔𝗰𝗰𝗲𝗽𝘁 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻𝘀.

Understanding Quorum-Based Approaches in Distributed Systems - Jaimin Bariya

7 Ways to Boost Your Cybersecurity Career with CISSP Certification