DEV Community

Cover image for Learnings from a 5-hour production downtime!
Garvit Gupta
Garvit Gupta

Posted on • Originally published at blog.garvitgupta.in

Learnings from a 5-hour production downtime!

As with all the incidents, it happened on a Friday evening!

In this article, I’ll delve into the causes and prolonged recovery time of a recent 5-hour downtime in one of our critical production services.

The affected service, a Node.js application, manages data transactions with PostgreSQL, sustaining peak loads of 250K requests per minute. Our server infrastructure is orchestrated via Kubernetes, with AWS RDS serving as the backend database.

The Beginning (5 PM)

The problem started around 5 PM when we started receiving unusually high traffic on the servers, 3–4 times the normal traffic. Due to this increase in traffic, the database server started degrading and in 15 minutes database degraded so much that it was barely able to process any queries.

First Response (5:20 PM)

We investigated possible causes for the traffic surge, such as a marketing campaign, but found nothing conclusive and the traffic was still increasing. To manage the traffic and allow the database to recover, we implemented temporary rate-limiting rules on our firewall. This resulted in a decrease in traffic and signs of database recovery.

Second Attack (5:45 PM)

Just as we believed the incident had concluded, the RDS console flashed ‘Storage Full.’ The database had exhausted its storage capacity, rendering it unable to process any new requests. Knowing that AWS allows easy storage expansion, we promptly increased the storage capacity. To our surprise, we saw an error that storage cannot be increased. After multiple unsuccessful attempts to increase the storage, we found that in AWS, the storage of an RDS server cannot be increased more than once in 6 hours (AWS reference).

Storage optimization can take several hours. You can’t make further storage modifications for either six (6) hours or until storage optimization has completed on the instance, whichever is longer
But we recalled that we haven’t increased the storage in the last 6 hours, then who did?

Hidden Attack (5:30 PM)

In AWS RDS, you can configure auto-scaling for storage, allowing an automatic increase in storage when it reaches near capacity. Our database had auto-scaling configured. By 5:30 PM, the surge in traffic had already pushed the database storage to its scale-up threshold, triggering an automatic scale-up. This meant that we would not be able to increase the storage for the next 6 hours!

Darkness

We couldn’t afford to wait six hours to increase storage because the period between 5–10 PM sees the highest traffic. Given the critical nature of this service, any delay would severely impact user experience and business operations. We considered restoring a backup on a new RDS server and decommissioning the current one. However, since the last backup was taken 3 hours ago, implementing this solution would result in a loss of 3 hours of data.

Ray of Hope (6:30 PM)

After consulting with service owners, we concluded that losing 3 hours of data was acceptable. The nature of the service was such, that once the service is back online, any lost data, will be recreated. So we started preparing for the the point-in-time recovery of the database. We provisioned a new RDS server mirroring the current configuration, but with expanded storage, and initiated the backup restoration process. Anticipating from previous experiences, we estimated the restoration process to take approximately 20–30 minutes.

Darkness once again (7:15 PM)

Even after 45 minutes, the restoration process was not complete. We started checking why it was taking so much time (there is no progress bar while restoring so we didn’t know if it would be done in 10 minutes or if it would take 10 more hours). We discovered that the server’s CPU usage was almost 100%, likely causing the restoration slowdown. However, increasing CPU capacity wasn’t feasible as it required changing the RDS instance type, something that couldn’t be done while the restoration was in progress.

Back to square one (8:00 PM)

After waiting for 45 more mins we decided to increase the CPU. The only solution was to create a new server and initiate the backup restoration process again. We kept the current server on which the CPU was becoming a bottleneck for restoration and simultaneously started restoration in a new server with 3 times the CPU, hoping that we will use the one that gets finished early. In the new server, the CPU was no longer a bottleneck, stabilising at 50–60%.

Still Not Done (9:00 PM)

Even after an hour, the backup process continued on both servers. Concerned about other potential bottlenecks, we began checking metrics for the new server. Turned out that this time IOPS was the bottleneck (IOPS is the measure of Disk IO that can be done per second, IO requests beyond the threshold are throttled). We paled at the thought of having to restart the recovery process from scratch, once again!

The Last Stand

Fortunately, AWS allows increasing the IOPS during the backup restoration process. Doubling the IOPS resolved the bottleneck. Finally, by 10 PM, we successfully completed the backup restoration and updated the service configuration to connect to the new server, by 10:15 service stabilised and it started handling traffic as usual. The next day we reduced all the resources that we over-provisioned during restoration.


Learnings

  • Maintain adequate free storage on database servers, before the incident our database server was already running at 90% storage, we could have proactively increased the storage to avoid storage bottlenecks during the incident.
  • While restoring a backup, database server resources should be over-provisioned to avoid bottlenecks.
  • The rate limiting implemented reactively during the incident should have been implemented proactively to manage sudden traffic spikes before they impact the servers.

Thank you for reading. Until next time, happy reading!

Top comments (4)

Collapse
 
thomasmoreee profile image
Thomas More

Thank you for sharing your insights. Proactive measures are crucial in maintaining the stability and performance of our database servers.

Considering your points, it's clear that maintaining adequate free storage on our database servers is essential to avoid storage bottlenecks, especially during critical incidents. In hindsight, increasing storage capacity proactively could have mitigated the risk of encountering such bottlenecks.

Furthermore, your suggestion to over-provision resources during the restoration of backups is well noted. Over-provisioning resources can help ensure smoother operations and minimize the impact of potential bottlenecks during such critical processes.

Lastly, implementing rate limiting proactively to manage sudden traffic spikes is a sensible approach to prevent server overload and maintain optimal performance. By anticipating potential traffic spikes and implementing appropriate measures beforehand, we can better safeguard against disruptions and ensure the seamless functioning of our servers.

Moving forward, we must prioritize proactive measures to address potential challenges before they escalate into critical incidents. By doing so, we can enhance the resilience and reliability of our database infrastructure.

If you need further assistance or coursework help in implementing these proactive measures, please feel free to reach out.

Collapse
 
rouilj profile image
John P. Rouillard

Have you disabled the automatic RDS storage scaling? Also, it sounds like you hadn't tested restoration of backup before. Is that true?

Collapse
 
garvit_gupta profile image
Garvit Gupta • Edited

Hi John, no we haven't disabled automated storage scaleup, we have added alarms when remaining storage is <30% so that we can increase the storage manually before the auto-scaleup threshold.

Also, it sounds like you hadn't tested restoration of backup before

What makes you think so? We have restored backups earlier but unlike this time we never faced bottlenecks due to CPU or IOPS.

Collapse
 
rouilj profile image
John P. Rouillard

Hello Garivt:

We have restored backups earlier
Sorry, bad assumption on my part.

but unlike this time we never faced bottlenecks due to CPU or IOPS.
Exactly this. How was/were your previous restore(s) different from this restore?
Why didn't you see bottlenecks before?

I assume the smaller system (CPU bound) should have been able to restore the 3 hour
old backup quickly based on prior experience. The service then would backfill all the data from
the time of the backup to the time the restore was completed. When you started the
restore this time, what was the expected completion time: minutes, hours?

Since you worked around the issues by provisioning more CPU and IOPS, the restore
wasn't physically limited by the ongoing disk activity from the automatic storage
migration. Hmm, maybe that's a bad assumption. Did provisioning more IOPS move
you to a different storage subsystem (away from the one handling the migration)?
Also did you notice what happened to the CPU use when you increased the IOPS?

I've had to do/assist with a few DR's in my career (catastrophic storage failure, virus, bugs, etc.).
But it's all been on the hardware in the company's DCs. I've never had an issue where DR fell well
outside the predicted and tested times. So I'm curious if this is a new issue for cloud services.

What are your thoughts?