194 years of downtime: looking back on incident data from 2018

#outage #devops

Statuspage customers logged more than 194 years of collective incidents in 2018. That’s a whopping 87% increase from the 104 years logged in 2017, and we aren’t even through December yet.

Open incident communication is becoming more and more important to companies and their customers. This is underlined by the big names who have set up a public Statuspage this year like Github, LinkedIn, and Yelp. With more focus on incident communication comes more focus on incident management in general. Companies are spending more time and resources preparing for downtime, as we learned from a handful of customers we profiled on how they prepare for high traffic days.

We dug deeper into our 2018 data to get a better idea of when and how our customers communicated around downtime this year. The data represents all reported incidents – from small blips in service to large-scale outages – plus any planned downtime logged through scheduled maintenance.

What the numbers mean

Sure, the sharp increase in hours of incidents logged from 2017 to 2018 can in part be attributed to an increase in total number of Statuspage customers, but we also believe it reflects the increasingly cloud-first mentality of companies relying on SaaS products. Companies are choosing to communicate around these incidents, and customers have come to expect this type of transparency.

In addition to the jump in number of incidents logged this year, we also saw the average number of updates per incident nearly double. With an average of 4.4 updates per incident this year, we believe that companies are prioritizing frequent, transparent communication with their customers.

We were also surprised to see that nearly half of our customers (45% to be exact) have opted into some form of page automation by integrating with an alerting or monitoring tool. While we advocate for always keeping a human element in your incident comms process, setting up some level of automation can definitely save time when it matters most. Many customers take this hybrid manual/automated approach to save time without risking a poor customer experience.

While incidents logged and updates posted are rising, there are still very few postmortems written – only 3% of incidents logged in Statuspage over 2018 had a postmortem attached. This isn’t too surprising, as not every incident requires a postmortem (and some companies write postmortems on a company blog instead), but we imagine this percentage rising over 2019 as customers come to expect this type of follow-up.

Stand-out incidents

There are some days when downtime is more likely for certain companies or industries. Cyber Monday is one example – a day where e-commerce companies see an exponential increase in traffic to their websites or apps. For Amazon, Prime Day (their biggest sale of the year) is that day – rivaling even the craziest Black Friday and Cyber Monday traffic. Though the retail giant still achieved a record year in sales, shoppers had trouble connecting to Amazon.com for over an hour, causing a lot of customer frustration and an estimate of up to 100 million dollars in revenue loss. The silver lining was a flood of cute dog pictures on Twitter, showcasing the power of a great error page:

For Epic Games , their “prime” traffic days came as players flocked to play their very popular video game, Fortnite. They experienced periods where over 3 million gamers were playing concurrently, resulting in some big service interruptions. During an incident in June, players from all over the world headed to Epic Game’s status page to see what was going on, resulting in a peak of about 15,000 requests per second.(Our most highly trafficked incident to date.) Major kudos to Epic Games for writing very thorough postmortems to close the loop on big incidents.

Some form of downtime is inevitable – especially with an extreme load like the one Fortnite experiences. Epic Games shows us that it’s how you handle that downtime and communicate with your customers that really matters.

And we can’t forget the IRS , which had an unusually stressful 2018 Tax Day when their website crashed on April 17th, the tax filing deadline. This was highly problematic as approximately 10 million Americans wait to submit their taxes on the last day. They ended up extending the deadline to April 18th, but communication in the meantime wasn’t exactly ideal. The original IRS error message reported a planned downtime event from April 17th, 2018 to Dec 31st, 9999 – yikes.

Downtime happens to the best of us, but accurate and frequent updates go a long way. We wrote an open letter to the IRS offering some advice and a free Statuspage – offer valid until Tax Day 2019. We’re still waiting for them to take us up on it.

#HugOps for 2019

While there may have been more hours of downtime this year, there was also a lot more love and appreciation (#HugOps) shown to the companies who were open about the bad times – more than 7,000 tweets and retweets mentioning HugOps, in fact. We started sending actual HugOps posters to people who retweeted our digital HugOps posters, and have sent more than 70 this year. That means 1% of all HugOps tweeters are now proudly displaying a Statuspage HugOps poster in their office like the one below – hooray!

The latest in Atlassian for incident management

While incident communication is a large part of incident management, it’s only one piece of a bigger puzzle. At Atlassian we’ve doubled down on our investment in incident management tools and practices. Check out what we’ve been up to:

*Postmortems for * *Jira Ops: *One of the most important parts of the incident management process is the postmortem. This is where incident response teams can learn, improve, and collect all the returns for the time and investment made trying to resolve the incident. Unfortunately, the postmortem process is often neglected because it’s too time consuming and difficult to manage. A key time-saver with JiraOps postmortems is the incident timeline, which gathers all the key events from the incident in chronological order. Teams can analyze what happened, identify root causes, and create Jira Software issues directly from the postmortem to ensure actions are taken to improve from every incident. Learn more.

*Automation Actions for * *Opsgenie: *Incident responders often take predictable, repetitive actions in response to an alert. These actions might include gathering more info about a particular system, running network diagnostics, increasing cloud resources, or restarting a service. Automation Actions enable you to run automated scripts and playbooks via 3rd-party platforms. Opsgenie now offers support for two automation integration methods: AWS Systems Manager and Generic REST Endpoint. Teams can integrate with these platforms to trigger the automated tasks right from the Opsgenie console or mobile app. This saves responders time, reduces the number of applications they need to use during incident response, and can positively impact MTTR. Learn more.

Tweet this report, get a poster

Anyone who tweets will receive a free HugOps poster to display as a reminder that your team is supported when downtime strikes in 2019…

This article appeared first on Atlassian Blog.