Recently I made a post of how we moved from elixir to go because we needed more dev power, and elixir's strengths were not panacea to us (which made the community very upset), which was something I hate, but now I would like to talk now about something I love.
Outages.
I love outages.
When pagerduty calls, or someone posts 'hey something is wrong' on #tech-escalation, and my adrenaline starts pumping, I get so alive!
This is the list of things we do when outage starts.
1. Make a hangouts call
Before anyone knows what is going on, just make a call. Don't discuss on slack, don't wait. Make a call and ask for help, even if it is false alarm.
2. Pick a leader
It is super important to say who leads, it is almost always the person with most context, but sometimes they are busy doing reconnaissance, in that case it is usually me, or I pick one who I know can lead.
The call leader will resolve any deadlocks, make sure people are not duplicating work, as every second is essential, and also find support resources if needed.
3. Divide and Conquer
The leader starts assigning tasks, one person needs to identify the surface area of the bleeding, one has to preliminary rollback any service that is even remotely related, one has to start investigating the impact and communicate with Customer Success.
Always rollback first, think later. This is also why it is very important to have rollouts and rollbacks reasonably fast. The faster you can roll out, the faster you can roll back when you see something is wrong.
The leader also must call other people if needed, and also notify Ronen (our CEO) about the status.
4. Cleanup
After the issue is fixed, dedicate a small team to work with CS to do damage control, talk with the affected users and do as much as possible to mitigate the damage.
5. RFO
Write a small document identifying what happened and what we learned, and what we have to work on in order to be better.
Talk especially about:
a) Alerting and Monitoring
Did we find out about the issue ourselves or customer had to tell us? It is at upmost importance to be able to find about issues before they impact users, sometimes this means we missed something in our end-to-end test and it has to be fixed asap.
b) Rough timeline
We must write a rough timeline so we can investigate if we can speedup the process somehow, for example, how much time did it take between the first error message and the creation of the hangouts call, how long did it take to select the tasks and etc.
c) What was the impact
Just rough estimate about affected users
d) What was the fix, and how can we avoid having this class of issues.
Can we fix this with better linting? Can we tweak our process a bit?
The best part about an outage, is that it makes me be part of a team. Of course working with my team every day is also nice, but its the difference of camping with friends and camping with friends while being attacked by a grizzly bear. Outages are just exciting. The atmosphere so nice, how everyone has everyone's back. There is zero blame. Everybody trying their best to help.
Now during covid, I think outages is indispensable in bringing our remote team together.
Good vibes!
PS: pls if you are the kind of person who is looking for someone to point to when shit hits the fan, don't apply.
Top comments (0)