The following is the incident report for the Nginx connection failure that occurred on April 3, 2023. We understand this service issue has impacted our valued developers and users, and we apologize to everyone who was affected.
Issue Summary
From 6:30 PM to 7:00 PM WAT, curl requests to our local server on port 80 resulted in connection refused messages. At its peak, the issue affected 100% of traffic to this server. The incident caused a delay in processing critical business operations for some of our users.
Timeline (all West African Time)
- 6:25 PM: Configuration change is made by Engineer Daniel without peer review
- 6:29 PM: Nginx is reloaded
- 6:30 PM: Curl request fails, and issue begins
- 6:30 PM: Engineer Pelumi notices connection failure
- 6:50 PM: Engineer Pelumi investigates the issue and identifies the root cause
- 6:55 PM: Successful configuration change rollback by Engineer Pelumi
- 6:59 PM: Nginx restarts begin
- 7:00 PM: 100% of traffic back online
Root Cause and Resolution
At 6:25pm, Engineer Daniel made a configuration change to the nginx configuration file that specifies the ports that the server listens to. The change specified that the default server only listens to requests made on port 8080. This change meant that requests made by our developers and users on port 80 would return connection failure. The configuration change was made without peer review, and this contributed to the issue.
At 6:29pm, nginx was reloaded to apply the configuration changes, and a minute later, curl requests on port 80 returned connection failure. Engineer Pelumi who made the curl request and noticed the connection failure investigated the issue and identified the problem at 6:50pm. Engineer Pelumi rolled back the configuration change and made changes to the configuration file to listen on port 80. The web server was reloaded to apply the changes, and upon another curl request, the connection was restored, and the expected output was returned by the server.
Corrective and Preventative Measures
We have taken the following corrective and preventative measures to ensure that similar incidents do not occur in the future:
- We have implemented a peer review process for all configuration changes made to production systems.
- We have added additional monitoring and alerting mechanisms to detect configuration issues and notify the appropriate team members in a timely manner.
- We have updated our incident response procedures to ensure that all team members are familiar with the steps to take in the event of a configuration issue.
We apologize once again for the inconvenience caused, and we remain committed to providing our users with the best possible service.
Top comments (0)