Many people consider it to be Incident Response. In this case an incident refers to something not operating correctly in a cloud environment. The issue could be small - say a performance impact that doesn’t really affect successful operations. Alternatively, the impact could be massive - an outage that has ceased all operations with data loss, revenue loss, and damaged reputation (consider the recent Facebook outage or Roblox outage).
When an incident occurs, organizations will typically be alerted from a monitoring tool. These tools have parameters and ranges of acceptable use (think too many 500s, or too long to respond). The tool will likely then trigger an alert, which could come from another tool. The alerting tool could be simple (page everyone) or there could be sophisticated rules to page just the on-call team, or the subject matter experts for the type of incident. Next comes the much harder part, fixing the problem.
These three terms all refer to the same basic result: the incident is fixed and operations are back to normal. Depending on the incident, the difficulty to resolve could be very simple, or take hours or days. The amount of time to repair the incident is called MTTR. Obviously, you want your MTTR to be as low as possible, and you want to consider more advanced tools and methodologies to achieve that. Reducing MTTR is one of the key objectives of a site reliability engineer (SRE).
Savvy organizations start by creating runbooks. These runbooks are basically an instruction manual on what to do, in what order, to remediate the incident. Simple incidents could be handled by level 1 support personnel, while multi day outages will be all hands on deck.
Runbooks can have many steps in them, but a typical set of high level steps are as follows:
- Type of incident, what services are affected
- How to collect the data and logs to verify the incident
- What to do to correct the incident (this could be pages)
At Fylamynt, we call runbooks a workflow. Fylamynt has built the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed, removing mistakes and simple errors.