An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.
However, an incident can be of any nature, it doesn’t have to be tied to security, for example:
- Physical damage to hardware or systems (fire, flooding)
- Human error (misconfigurations, accidental deletion of data)
- Malicious actors (denial of service attacks, malware, ransomware)
Every incident is different and may require a different response. The incident response consists of steps taken by an organization to address the outage and reinstate services to their normal operation, often in real-time. For example, treating an outage is referred to as an incident response.
A good incident response plan can help your company respond quickly and effectively when an outage occurs. Keep in mind that incident response is not just a technical function to be done by a specific team. Instead, it is more of a corporate process that involves all areas of the business.
The biggest change in the world of incident response was the widespread adoption of automation.
Traditionally, the incident response was a highly manual process. Everything from creating a ticket to patching a server required human interaction. It was effective until the world experienced the internet boom.
Easy internet access has certainly opened up opportunities for people and businesses alike. According to IDC, 60% or more organizations have spent more on technology to embrace the digital future.
The rise in use of digital platforms has resulted in complex infrastructures with multiple application dependencies. Hence, downtime and system failures for even a few minutes can incur huge monetary losses (in some cases, even millions).
In order to avoid such events, organizations have resorted to dealing with incidents using teams that are on-call 24/7. This puts a lot of pressure on incident response teams as they are required to manually monitor systems, keep track of alerts and avoid fatigue. Hence automating some or most of the incident response processes can help get rid of repetitive work. It helps response teams be more effective with less effort.
That's not to say that people are no longer involved with incident response. People are still involved in triage, troubleshooting, and postmortem analysis. It's just that those tasks are much less frequent than they were before automation became the norm.
Incident Response used to be about reacting to what happened with a solution for an immediate ‘bleed stop’. Nowadays, it is more about being proactive and trying to prevent incidents altogether by understanding and gaining intelligence about why something has happened.
Incident response and management have become more of a DevOps-based activity. Where operational issues are addressed through code and automation, rather than manual intervention.
In the SRE (Site Reliability Engineering) realm, the incident response can be divided into following steps:
- Resolution and Recovery
Let’s expand on those and understand how incidents were responded to in the past, and how they are now.
This step is where you detect an issue or determine if there has been a breach. A breach or incident could originate from different sources.
Traditional: Primary source of detection would most likely be calls or emails from the impacted users. Monitoring and alerting tools weren’t as ubiquitous as today.
Modern: An issue will usually be caught through monitoring and alerting on metrics, or in another case by people noticing something strange while they're doing their work. With alerting tools and right schedules in place it is easier to detect such issues so they can be dealt with due process.
This is the step where you analyze the issue at hand and take a call on whether to contain the damage or terminate the concerned services.
Traditional: The limitations of technology made it difficult to connect globally. Cross-functional localized teams would come together to figure out the issue. It often led to forcing resources to quit the work at hand and focus on solving the issue. This chopping and changing would particularly impact developers the most.
Modern: Modern-day teams analyze the metrics and logs to determine how bad the outage is. Is it a brief spike in errors? Are a few nodes going offline? Or is it a full-on service disruption? This step involves analyzing metrics and logs before responding further. This is where your colleagues from other sectors would collaborate for help. Using modern ChatOps tools like Slack, Microsoft Teams helps in effective collaboration. This keeps the right people connected even globally if needed.
Once you've analyzed and pinpointed the root cause, you need to resolve the issue and ensure the system has recovered, with the affected systems and devices up and running again.
Traditional: The process was unstructured. There was a lack of coordination between people, which led to support people tripping over and duplicating efforts. The aim of recovery was to get the system up and running, and nothing much followed. Getting to the root cause was rarely an objective until the same issue occurred repeatedly.
This changed with time as processes were put into place. But lack of automation meant that the on-call schedules were still not very efficient and there was a lot of manual work.
Modern: These days, various tools and techniques are used to deal with issues. The decision is based on the issue that is being dealt with and the team's capabilities. For example, if you're experiencing network issues and your team has access to network engineering resources, they may be able to resolve the issue quickly by adjusting settings on routers or switches.
Recovery is usually coordinated by the on-call incident handler, who is responsible for implementing a solution and making sure it does not fail. The SRE team then follows up with the manager to make sure the fix works as intended and, if necessary, to mitigate any damage caused by the outage. Another goal is to prevent such incidents from happening again
A postmortem is written after the issue is resolved, and everything has calmed down. Once the postmortem write-up is ready, a meeting occurs and is led by an SRE manager or incident handler who distributes the postmortem notes to relevant parties within the organization.
The goal of this meeting is to review what happened during the outage, why it happened, what was done to stop it, and how it could have been prevented in the future. The postmortem then becomes part of an organization's operational history, allowing teams to learn from past mistakes and improve their overall reliability going forward.
Traditional: Traditional postmortems were either internal reports that were never seen outside the company or formal reports submitted to external auditors.
Both constraints made it difficult to share detailed information about what happened and why it happened. Traditional postmortems are typically tactical documents that focus on how IT personnel responds to an incident.
Modern: The practice of postmortems is an established part of modern incident response and is generally written as after-the-fact documentation.
Modern digital postmortems are more inclusive of all teams involved, including the stakeholders. And should be viewed as strategic points that focus on lessons learned by the entire organization.
They can be used for training purposes since they document case studies from completed investigations. They allow you to:
- Analyze past issues
- Find trends and make predictions about future risks
- Help you learn from mistakes
- Prevent a recurrence.
An excellent example for a postmortem template, and what should be included, can be found in the first SRE book by Google.
This brings us to the end of this blog. We have successfully explored what incident response is and how it has evolved with time.
Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution and create a knowledge base to effectively handle incidents.