Postmortem: Service Outage on MathEase Platform

#sre #webdev #devops #beginners

In a world not so far away, where APIs responded happily until they didn't... 😥

Issue Summary.

Duration - 1 hour 25 minutes (Start: 07-10-2023 10:21 GMT, End: 07-10-2023 11:46 GMT), coincided with my coffee break, lol.
Impact - unsuspecting and happy users were unable to get answers to math questions, it just keeps loading without returning a response (like bro, what the...).
Root cause - too much API calls to the WolframAlpha API, due to an increase in traffic after our marketing campaign blew off, i guess WolframAlppha was like, bro, this is way too much than we signed up for 😂.

The issue was detected at 07-10-2023 10:21 GMT by automated monitoring systems triggering an alert for external API failure and unusually high traffic.
Actions Taken - Assumed the issue might be related to a recent code deployment that introduced a performance regression, but then ran tests on the codebase and found out the WolframAlpha API was not responding. Took the issue to the backend team for further investigation.
Misleading Investigation - Investigated recent code changes extensively, diverting attention from the actual cause (if it was, I would have completely roasted Folarin, cause he pushed the last code to deployment, lucky him 😐).
Escalated - The issue was escalated to the backend team.
Resolution - Increased rate limit and paid for premium service of the WolframAlpha API. Also implemented emergency scaling of server resources to handle the unexpected traffic.

Root cause - The surge in number of external API calls and traffic was caused by a successful marketing campaign that brought in significantly more users than anticipated.
Resolution - Upgraded the WolframAlpha external API service to a premium one. Reviewed and updated infrastructure capacity planning to accommodate sudden spikes in traffic.

Enhanced monitoring to provide early warnings for external API failures.
Conduct regular capacity planning exercises to anticipate and handle increased user loads.
Tasks.
Enhance communication strategies for informing users during service outages.
Conduct a thorough review of the incident response process for better coordination.
Implement the early warning alert system on DataDog.