Public Cloud Group for kreuzwerker

Posted on May 10

Incidents and Operational Resiliency - Why it Matters and What to Consider

#aws #cloud #devops #elasticsearch

Written by Thomas Hoffmann

Utilizing technology and new work methods to save money

Introduction

Are you prepared for an IT incident? If a core component of your infrastructure suddenly fails or data is lost, who would you contact? About two thirds of German mid-size companies would shrug off an answer to these questions [1] - even though being prepared could save enormous costs.

Facts and Data

An average IT outage in a mid-sized company (200-5,000 employees) costs about 25,000 Euro per hour according to a recent study. On average, German companies experience up to four of these outages per year where each outage lasts about 3.8 hours - causing an annual economic damage of over 380,000 Euro! [1]

There are many reasons for the high level of damage: even though only one third of the outages registered has any impact on customer operations [2], internal disruptions can also lead to widespread loss of productivity.

Causes and Mitigation

A process review can offer the greatest remedy: a full 20% of outages can be traced to poor process adherence [2]. At this point, the root cause must be carefully examined: the reason for the process deviation is an important clue as to what can be improved. In times of hybrid and remote work environments, the requirements for processes change as well. "People before process" is a helpful mantra to consider, keeping focused on adjusting processes to the needs of your staff. This does not mean that "sensitivities" should dictate work flows, but that processes should support employees in their work as much as possible and not hinder them.

But it is also possible to use technological influence: especially hyper-scalers such as AWS offer a plethora of possibilities to react to outages and errors in a variety of ways - be it by utilizing smart monitoring and alerting or even automatic error resolution, for example by restarting a certain service or machine.

The choice of your cloud provider is therefore the first factor in a resilient infrastructure: AWS is one of the few cloud providers that has guaranteed the physical separation of its availability zones since its early years, thereby establishing geophysical redundancy. Microsoft Azure only established mandatory geophysical separation in 2018 [3] and Google Cloud Platform still does not guarantee significant physical separation of its zones, although they famously provided the reason why this makes sense in 2023. [4]

The services and technologies used are also a key factor to achieving resiliency: smart monitoring and logging as well as properly configured autoscaling and established error management already go a long way.

Finally, a missing or unknown disaster recovery concept is another reason for long-lasting outages. While a prepared company can ideally restore basic operation with the push of a button, unprepared ones often times have to take inventory first to see what actually needs to be worked on to restore operations.

A famous scenario where you directly profit from being prepared would be a ransomware attack. This not only affects your applications, but also renders all company data inaccessible. A well-architected cloud infrastructure with protected (tamper-proof) backups can safe significant amounts of time and money in this case: affected applications can be quickly terminated and well-trained data recovery operations can reduce any data loss to an acceptable level.

Conclusion

Incidents and outages are expensive. Even more so with a growing number of employees and/or customers. Having to deal with an outage ad-hoc is prone to errors and takes a lot of time. Being prepared by utilizing current data from economy and research and establishing change, incident and recovery procedures will help to avoid incidents or at least keep them short. To prepare, all parts, processes, infrastructure and threat models of your production chain should be considered.

Make our Expertise Your Own!

As an AWS strategic partner, kreuzwerker can call on many years of experience to support you on your journey to resiliency: from best practice and process reviews and optimization over to providing expert knowledge on various AWS technologies, on observability and ElasticSearch as well as orchestrating your microservice deployments on Kubernetes - we are happy to support you to the best of our ability.

Don't hesitate to get in touch if you wish to review your infrastructure and processes in regards to resilience - we look forward to working with and supporting you on your resiliency journey!

[1] https://digitalisationworld.com/news/27800/hp-studie-it-systemausf-auml-lle-kosten-deutsche-mittelst-auml-ndler-im-durchschnitt-fast-400000-euro-pro-jahr

[2] Uptime Institute, Annual Outage Analysis 2023

[3] https://azure.microsoft.com/en-us/blog/azure-availability-zones-now-available-for-the-most-comprehensive-resiliency-strategy/

[4] https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPfkY

DEV Community

Incidents and Operational Resiliency - Why it Matters and What to Consider

Introduction

Facts and Data

Causes and Mitigation

Conclusion

Make our Expertise Your Own!

Top comments (0)

Read next

Terraform vs AWS CDK: ¿Qué herramienta de infraestructura como código es mejor para tu proyecto?

Scaling to Zero with Amazon Aurora Serverless v2

Simplify Environment Variable Management with GitHub Environments

Container Orchestration with Kubernetes