How do you deal with incidents?

#discuss

It's near the end of the workday and something is blowing up. People are raging in slack. How do you manage the chaos? I'm no expert at emergencies I've been through many but I'd like to know how you manage problems at scale? How do you keep a handle on issues?

Top comments (5)

Mitch Pomery (he/him) • Jan 14 '21

If your incidents are chaos, it sounds like people aren't prepared for them (unless something truly chaotic has happened like a lightning strike taking out your primary DC).

I really like the PagerDuty Incident Response docs for outlining what the roles are in an incident and who needs to do what (and who needs to stay out of it). My work doesn't have documentation similar to it, but we do have dedicated incident response managers who help co-ordinate incidents and get the right people involved.

When I'm personally in incidents I make sure to speak about what has happened, not who has done things (i.e. "The firewall rules have changed" instead of "name changed the firewall rules") and being explicit in stating what I am going to do, instead of asking permission (i.e. "I am going to redeploy X" instead of "Can I redeploy X?"). I have found both of these help keep only the people needed involved and reduce the time to recovery.

As an example, my team was called at 8AM with a major incident. The person responding asked the incident manager "Can I redeploy" to which the incident manager asked "Can you? Who do we need to ask?". Soon there were 5 other teams involved in the incident all asking "Can this be redeployed" with all the other teams going "I don't know". When our team changed from asking "Can we redeploy?" to "We are redeploying", the incident manager immediately agreed and soon after the incident was resolved.

Hunter Peress • Jan 15 '21

Hi Alan! I have input, but I see you posted yesterday. Wondering if you resolved and have some additional insight?

Alan Barr • Jan 15 '21

Thanks for checking in. At a medium/big company incidents feel constant I'm luckily not involved in all of them and this one was fortuitous to finish before the "end of the day". We're going on this cloud-native journey and this app is an old style of application secret manager with a web app combined with a traditional database so a couple of points of failure. We have alerting tools but everyone isn't super aware that a process should send alerts to the supporting team instead they might just @here in slack in a big channel. Super minor compared to other stories I have where a specific technology failed big time for a couple of weeks, the vendor could do nothing to help us, and everything migrated off shortly after and wrecked trust for six months to a year. It was a big to do starting on April fools day shortly after a previous huge outage caused by a reboot storm on a suddenly failing ESX host (imagine if your AWS ec2 vms suddenly stopped then started again in another region but then stole all the resources from all other Vms you had for a few hours slowing everything to a crawl). I like hard challenges and cleaning up messes is kind of a joy for me.

Hunter Peress • Jan 15 '21

Wow!! Never dealt with anything in the two week scale...scary! Most of my incidents were solvable in a day. I can remember 3 all nighters I had to handle over 5 years. Def slept in after those 😂😉 But Im a fan of continually improving, making the system more resilient, improving communication, and getting to root causes. Glad you like messes!!

Alan Barr • Jan 15 '21

Yeah it was only one all-nighter fortunately and rotating shifts with many different roles. Lots of frustration because many people depended on it even if it wasn't the best at what it does. It hasn't reappeared but it wasn't a clear root cause either besides make sure we have xyz VM storage settings just in case because this tech has a certain storage and processing story. I'm excited for this new Kubernetes world because resiliency and observability are more accessible but I'm concerned about new strange problems.