In a world not so far away, where APIs responded happily until they didn't... 😥
Issue Summary.
- Duration - 1 hour 25 minutes (Start: 07-10-2023 10:21 GMT, End: 07-10-2023 11:46 GMT), coincided with my coffee break, lol.
- Impact - unsuspecting and happy users were unable to get answers to math questions, it just keeps loading without returning a response (like bro, what the...).
- Root cause - too much API calls to the WolframAlpha API, due to an increase in traffic after our marketing campaign blew off, i guess WolframAlppha was like, bro, this is way too much than we signed up for 😂.
Timeline.
- The issue was detected at 07-10-2023 10:21 GMT by automated monitoring systems triggering an alert for external API failure and unusually high traffic.
- Actions Taken - Assumed the issue might be related to a recent code deployment that introduced a performance regression, but then ran tests on the codebase and found out the WolframAlpha API was not responding. Took the issue to the backend team for further investigation.
- Misleading Investigation - Investigated recent code changes extensively, diverting attention from the actual cause (if it was, I would have completely roasted Folarin, cause he pushed the last code to deployment, lucky him 😐).
- Escalated - The issue was escalated to the backend team.
- Resolution - Increased rate limit and paid for premium service of the WolframAlpha API. Also implemented emergency scaling of server resources to handle the unexpected traffic.
Root Cause and Resolution.
- Root cause - The surge in number of external API calls and traffic was caused by a successful marketing campaign that brought in significantly more users than anticipated.
- Resolution - Upgraded the WolframAlpha external API service to a premium one. Reviewed and updated infrastructure capacity planning to accommodate sudden spikes in traffic.
Corrective and Preventive Measures.
- Enhanced monitoring to provide early warnings for external API failures.
- Conduct regular capacity planning exercises to anticipate and handle increased user loads.
- Tasks.
- Enhance communication strategies for informing users during service outages.
- Conduct a thorough review of the incident response process for better coordination.
- Implement the early warning alert system on DataDog.
Top comments (0)