DEV Community

Cover image for Postmortem: Service Outage on MathEase Platform
Hameed Osilaja
Hameed Osilaja

Posted on

Postmortem: Service Outage on MathEase Platform

In a world not so far away, where APIs responded happily until they didn't... 😥

Rest in peace APIs

Issue Summary.

  • Duration - 1 hour 25 minutes (Start: 07-10-2023 10:21 GMT, End: 07-10-2023 11:46 GMT), coincided with my coffee break, lol.
  • Impact - unsuspecting and happy users were unable to get answers to math questions, it just keeps loading without returning a response (like bro, what the...).
  • Root cause - too much API calls to the WolframAlpha API, due to an increase in traffic after our marketing campaign blew off, i guess WolframAlppha was like, bro, this is way too much than we signed up for 😂.

Timeline.

  • The issue was detected at 07-10-2023 10:21 GMT by automated monitoring systems triggering an alert for external API failure and unusually high traffic.
  • Actions Taken - Assumed the issue might be related to a recent code deployment that introduced a performance regression, but then ran tests on the codebase and found out the WolframAlpha API was not responding. Took the issue to the backend team for further investigation.
  • Misleading Investigation - Investigated recent code changes extensively, diverting attention from the actual cause (if it was, I would have completely roasted Folarin, cause he pushed the last code to deployment, lucky him 😐).
  • Escalated - The issue was escalated to the backend team.
  • Resolution - Increased rate limit and paid for premium service of the WolframAlpha API. Also implemented emergency scaling of server resources to handle the unexpected traffic.

Root Cause and Resolution.

  • Root cause - The surge in number of external API calls and traffic was caused by a successful marketing campaign that brought in significantly more users than anticipated.
  • Resolution - Upgraded the WolframAlpha external API service to a premium one. Reviewed and updated infrastructure capacity planning to accommodate sudden spikes in traffic.

Corrective and Preventive Measures.

  • Enhanced monitoring to provide early warnings for external API failures.
  • Conduct regular capacity planning exercises to anticipate and handle increased user loads.
  • Tasks.
  • Enhance communication strategies for informing users during service outages.
  • Conduct a thorough review of the incident response process for better coordination.
  • Implement the early warning alert system on DataDog.

Top comments (0)