DEV Community

Discussion on: How do you deal with incidents?

Collapse
 
hunterpp profile image
Hunter Peress

Hi Alan! I have input, but I see you posted yesterday. Wondering if you resolved and have some additional insight?

Collapse
 
alanmbarr profile image
Alan Barr

Thanks for checking in. At a medium/big company incidents feel constant I'm luckily not involved in all of them and this one was fortuitous to finish before the "end of the day". We're going on this cloud-native journey and this app is an old style of application secret manager with a web app combined with a traditional database so a couple of points of failure. We have alerting tools but everyone isn't super aware that a process should send alerts to the supporting team instead they might just @here in slack in a big channel. Super minor compared to other stories I have where a specific technology failed big time for a couple of weeks, the vendor could do nothing to help us, and everything migrated off shortly after and wrecked trust for six months to a year. It was a big to do starting on April fools day shortly after a previous huge outage caused by a reboot storm on a suddenly failing ESX host (imagine if your AWS ec2 vms suddenly stopped then started again in another region but then stole all the resources from all other Vms you had for a few hours slowing everything to a crawl). I like hard challenges and cleaning up messes is kind of a joy for me.

Collapse
 
hunterpp profile image
Hunter Peress

Wow!! Never dealt with anything in the two week scale...scary! Most of my incidents were solvable in a day. I can remember 3 all nighters I had to handle over 5 years. Def slept in after those 😂😉 But Im a fan of continually improving, making the system more resilient, improving communication, and getting to root causes. Glad you like messes!!

Thread Thread
 
alanmbarr profile image
Alan Barr

Yeah it was only one all-nighter fortunately and rotating shifts with many different roles. Lots of frustration because many people depended on it even if it wasn't the best at what it does. It hasn't reappeared but it wasn't a clear root cause either besides make sure we have xyz VM storage settings just in case because this tech has a certain storage and processing story. I'm excited for this new Kubernetes world because resiliency and observability are more accessible but I'm concerned about new strange problems.