We have been running quite a successful venture https://cookups.com.bd for some time with users of the system increasing day by day.
Our startup Cookups is kind of Uber for home made food. It is one of the biggest portal for ordering home made food in our country.
Few days ago, there was a huge surge of cpu and ram usage in our system (run on top of aws of-course). Our initial assumption was, since our user base is growing exponentially day by day, this is quite a nice headache to have. The orders from users are piling up and all we need to do is choose a higher aws tier (say m series instead of t series) with higher cpu power and ram, and the problem will be resolved. Since it's a live system we didn't want to take any chances and was about to go for the higher configuration before normal senses prevailed.
Thankfully, we decided to run a test in the staging server importing the production database, with higher cpu and ram and see whether the problem gets resolved. The next day it showed cpu and ram usage surged again at a specific time and although we have more ram and cpu power, the problem really didn't subside.
Yes, more users are registering in our system day by day, but the cpu usage and ram usage shouldn't peak at a specific time. Then we found, there is a schedular that runs some scheduled task everyday and because there was some corrupted data in our database (due to some previous coding error which we fixed later), the database was getting locked and cpu and ram usage increased as the schedular was failing and retrying the scheduled tasks over and over again. So, we were about to throw hardware to a software problem.
Since it's easy to upgrade RAM and CPU in AWS (at-least that's what gets promoted intentionally or un-intentionally all over internet), the moment we see CPU or RAM usage increasing, there is an inclination towards vertical scale up although the problem may be elsewhere. That's why I called it the AWS trap.
So here are some tips on how to avoid going for scaling immediately without looking deep into the matter:
1) Check the nature of cpu and ram usage.
If the cpu or ram increases drastically at a given period of a day, high chance there is a problem in the code or db.
2) See if the RAM or CPU usage graph looks like the following:
Most likely, for a start up company, this type of graph for cpu or ram is not because of sudden user increase. This might be due to some coding error.
3) Before really going for scaling (vertical or horizontal), think about if there are ways to identify the bottlenecks (most likely, there are quite a few of them in initial stages whether you admit or not) and solve them first.
4) Review existing codes, improve sql/orm queries if possible and make sure you really understand whats going on.
5) Consider scaling and premature optimisation as the last option for any new system.