The unexpected and sudden shift to remote working introduces a new set of problems within the incident response space. And while each organization needs to take its own unique circumstances into account, this post outlines the best practices and steps that can be taken in the right direction in keeping operations both productive and proactive.
In light of the recent events happening around the globe, with increasing cases of COVID-19, many countries have publicly announced lockdowns. This has been fairly easy to manage for some companies but many are struggling to define appropriate Work From Home (WFH) policies.
This has hit operations-heavy companies more than most. Operations involve a massive amount of coordination, communication and responsiveness; all of which is tricky to accomplish when you suddenly have to deal with your team remotely.
One clear result of being quarantined and maintaining social distancing is how overwhelmingly dependent we have become on the digital world. So, operations teams and on-call folks are under added pressure to keep IT infrastructure and applications in top shape. As a result, it becomes really important that we stay connected through multiple modes of communication. After all, incident response is all about getting to the right person at the right time and communicating effectively, not just within the on-call team but also to external stakeholders.
Incident alerting and management tools are accessible to people irrespective of where they work from. But it’s only useful if your incident management practices are sound and compliment the tool.
Handling on-call on a normal day at work can be stressful, but if you are working remotely it becomes all the more crucial to ensure good communication protocol. It’s never too late to tweak your incident response processes to make it easier for Incident Management teams to be on the same page and ensure that your systems and services are always reliable. Here are some ways you can set yourself up for success.
As a company that focuses on best practices that help streamline incident response, we follow a few practices to always be remote-ready.
- Incident Communication
- Incident Response
- Incident Resolution
The cornerstone of any good incident response process is communication.
Document more: One thing to keep in mind to reduce the risk of misinformation or communication gaps, is to write more and write better. It’s always better to have a record of information and associated activity to go back to, if necessary. When in doubt, throw in a few more details.
Use a central Slack Channel: For those of you that love chatops or depend majorly on Slack for incident management, use a dedicated channel to bring in all your incidents. You might have to create separate channels for communicating with regard to specific issues and outages. But a central channel can act as an index, and prevent the chaos of looking for a specific incident and its status.
Virtual War Room: Goes without saying that collaboration is key to reducing your MTTR. You can mimic a traditional war room huddle by using a video conferencing tool or chat platform with the incident response team. With Squadcast, you can use our virtual war-room where you can chat, bring in other members from your team, SMEs, Stakeholders and business facing teams to ensure that all your goals are aligned.
Publish Meeting link: You can create a virtual meeting room for just the fire-fighting and keep that open throughout your on-call rotation. You can add the meeting ID along with the incident details or pin the details in the Slack channel or any other communication tool you use.
We typically use Zoom to keep an open meeting room that anyone with the meeting ID can join in. You should be able to do it with other tools as well.
Talking is faster than typing, so it can be tempting to just call up with every doubt; however, use it only if the situation calls for it. No one likes to be constantly interrupted.
Be transparent: Communication can take a major hit with work from home teams. This happens simply because you may think you’re communicating all the available information you know but may miss out on some prerequisites needed to comprehend the information better.
To avoid this, it’s great to just add in all of the relevant teams while dealing with an incident. Also, remember to update your status page with any new necessary information immediately post incident resolution.
This opens them up to all the activity taking place, the severity of the issue and acts as a single platform to discuss and share. With everyone always informed, you don’t have to struggle with context switching for just drafting the right message to send to external teams or customers.
Once you get the communication processes right, incident response gets simpler. You can focus on firefighting without having to worry about anything else.
Assign Roles: You are quickest when you know what you need to do. This kind of clarity can be achieved by simply assigning roles to your incident response team. This also helps distribute work that would otherwise fall on just the one person figuring out a fix.
For teams of just 1 or 2, just a checklist of items to do when an incident hits can go a long way. This clears the mind of any doubts about pending work to do.
Timeline of Incident Activity: Usually, a scribe is expected to maintain a record of all incident related activities. It’s always a good idea to not trust your memory in a high-stress situation. This allows you to have all the information necessary to analyze better, write better postmortems and create an effective playbook as a pre-emptive measure. With Squadcast, we use our automated timelines to understand the resolution activities while conducting postmortems.
Set-up an automated on-call rotation: If you haven’t already set this up, you can expect a downhill graph of motivation for all the engineers that do this today. It is highly likely that when you don’t have a rotation set up, the stress of incidents fall on just 1 or few.
It’s a load off your mind if you just knew beforehand when you’d have to go on-call. Rotations also help you assign appropriate load to everyone on the team.
Remember, being on-call is everyone’s responsibility.
Always Create Runbooks (with fallback options): It’s useful to create a knowledge base of all incident resolution information that one can refer to when similar incidents hit your service. This way, you don’t have to spend time figuring out the incident all over again.
Runbooks are especially useful to folks who are new to on-call or newer in your organization. It’s always good to have more information when you’re new.
Blameless Postmortems: Another great source of information is postmortems and post incident reviews. Not a lot of organizations follow through to finish a postmortem simply because it’s a long, tedious and sometimes stressful process. But the best way to ensure that an incident doesn’t occur again is to analyze why it happened in the first place and then making this information available for the entire team. In Squadcast, you can create postmortems of incidents from within the app and can be viewed by anyone on your team.
The unexpected and sudden shift to remote working introduces new risks. And while each organization needs to take its own unique circumstances into account, the aforementioned practices offer a step in the right direction in keeping operations both productive and proactive.
Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution and create a knowledge base to effectively handle incidents.