Scalability is the potential of an application to handle a growing number of users, customers, clients. It also defines how maintainable the application is over time. In this article, we are going to examine ways in which a production application can break and how we can fix that. We will take a hypothetical example of an application and see how we can scale it from a limited number of users to support a significant number of users. We are also going to look at some of the issues that will likely happen in a production application and how we can fix them.
You need to have an experience deploying application on any cloud provider such as Digital Ocean, AWS or Azure to get the best of this piece
Let's assume we are building an application that enables users to create, read, update and delete a todo task. This is a simple reactjs app that talks to a nodejs backend and data are stored in a database like Postgres. At some point in time, the app is ready, and is now time to deploy the app to a cloud provider like Digital Ocean so that people can use it.
Let's say we provision a droplet on Digital Ocean. On Digital Ocean, virtual private servers (VPS) are called droplets. Our VPS is running a Linux distributions system (1GB of RAM and 1vCPU) both our database and application are running on a single server.
We provision our server with 1GB of RAM and 1 vCPU which is sufficient to handle 100 users per day and everything is working very fine but one day we started getting complaints from our users that our website is not available. We look at the server and found out that the instance we provision is deleted due to some issues on the Digital Ocean platform and we don't have access to our database and application. We go-ahead to set up everything and redeploy the website and everything is back again. But then we started getting complaints from our users that they can't log in and we look at the database and found out that because the server crashed all our data is gone with it and we don't have any backup. So we apologize to our users and asked them to create a new account and start from scratch.
So we start implementing the backup strategy, we write a cron job that runs two times a day to take the backup of a database and store it on a server so that whenever anything goes wrong, we can restore the data back. But we can't keep the backup on the server because when the server crash, it will go with the backup and we can't restore it. To fix this issue, we take the backup and push it to amazon s3 so that even if our server is gone, we can have access to the backup. We also verify the backup from time to time to ensure that our backup is restorable. Now that we have a database backup our users are happy and are confident of using our application.
After some time, we found out that our application is down again and our users are getting 500's and 404's error messages. We go to the server and found out that the droplet is there unlike the first case that it was deleted so we decided to look at the logs by ssh into the server to find out what is wrong but we didn't find any logs this is because we are writing to the standard output so we try to guess what is wrong and since our users are getting 404 error messages, it may mean that the app is down i.e the node server is not running to fix this issue we restart the server and everything is working perfectly. So we improve our logic and start writing the logs to the files so that whenever our app goes down, we can access the logs from the files. We also implement the auto-restart logic so that whenever our application goes down, it will be automatically restarted we achieve this with the help of some process managers like PM2 or systemd and everything is back to normal.
As time goes by despite the fact that our application is up our users start getting some 500's error messages but since we have a log in place we can easily look at the server logs and know what is wrong. So we look at server logs and found out that our database connection pool is saturated and our app is unable to access the database this is because it cannot create a connection to the database.
A connection pool is a strategy to keep database connections open and reuse. Connecting to a database can be an expensive task so there is a connection pool and whenever a database client is done using it, it releases it so that anyone can reuse it without opening a new connection. In our current case, it may be that some of the clients didn't close the connection and all the connection are used up and a client is unable to create a new connection to the database that is why we are getting 500's error messages this is because we are going with the database default configuration and we didn't configure the database properly so we revisit the database connection and increase the pool size. Our application is fix and back to normal.
Now our app is normal but at the same time slow for users in the United State this is because reaching the server requires too many hops. Our current servers are in South Africa but we are targeting the global market. To fix this issue we decided to use some caching mechanism both on the client-side and server-side to speed out the request and we also change some of our deployment strategy by:
- deploying static assets to s3
- caching for dynamic content with Redis. With these changes, our application is faster and our users are happy.
We found out that our storage is increasing and logs are taking up most of the storage and this lack of space is causing our processes to slow down. To fix this issue, we implement log rotation
Log rotation is an automated process used in system administration in which log files are compressed, moved (archived), renamed or deleted once they are too old or too big (there can be other metrics that can apply here).In our case, we split the log by date, and the log is deleted after some time which means that our logs won't pile up on the server.
Again we notice our memory and CPU consumption is increasing. This is because we have a single server and both our database and application are running on the same server and competing for the resources. To fix this issue, we decided to separate servers for database and application and now we have memory and CPU for the separate servers and both are not competing for resources anymore and our application can now serve more users.
We keep getting more and more traction which means that our database has more load to handle and we have cron jobs running twice a day to back up our data and we have so much data to process and our application slows down because our database is busy with cron job implementation the backups. To fix this issue, we notice that our application has more reads than write this is because users retrieve todo items more than creating them so we decided to split in reads/write replica:
- writing to the primary database.
- reading to the secondary database. So with these in place, since we know that we have more reads than writes we can keep adding more databases to read from.
Data replication is the process of making multiple copies of data and storing them at different locations to improve their overall accessibility across a network. Database replication effectively reduces the load on the primary server by dispersing it among other nodes in the distributed system. This fixes our database issue and everything is back to normal.
Our application after some time is not able to hold the load that is getting from our users and we look at the infrastructure and found out that our single app is not going to be able to hold up to the load that we are getting so we decided to add more app servers and we use a load balancer
Load balancing is the distribution of workload across multiple computing resources. In the case of
web applications, this would mean distributing web traffic across multiple server instances. To
horizontally scale our application, we create a load balancing server, which balances traffic to all
application servers running our application. The load balancer would receive incoming traffic to
our site, and proxy this traffic to all application servers. This fixes our issue and our server can handle more users.
We again have a different problem because we have multiple servers, our logs are separated into these instances and whenever we have an issue, we have to ssh into all the instances and look at the logs one by one to identify the issue. To fix this we decided to integrate the ELK stack.
The ELK stack is an acronym used to describe a stack that comprises three popular open-source projects: Elasticsearch, Logstash, and Kibana. So instead of writing the logs to the server files, we decided to write the log to the logstash from there the logs are index to the elastic search and kibana provides us with the search and utilization of the logs that we have on the server and with this set up we can serve a significant number of users without going down.
In summary, there is more to system design and there is a lot of details I didn't go into details about them.