According to Merriam-Webster, resilience is defined as “an ability to recover from or adjust easily to misfortune or change.”
Toggle thought this topic was a good follow-up to last week’s conversation about productivity. We are currently having to adjust expectations about what being productive is. We’ve had to adapt to new remote working situations quickly. Systems are being pushed to the limits as large numbers of people quickly moved to cloud-based solutions for meetings, social gatherings, and educating students. How well you adjust to these changes requires resiliency.
Questions we posed on resiliency:
- How do you define resilience?
- How do you build resiliency in your systems?
- How do you increase your own tolerance for disruption and failure?
- What value can we derive from critical events?
If you are looking for resilience, you have to look at the big picture. From a technology perspective, if you are striving for five-nines availability, you have to look not just at the technology but the people, the processes, and the organization as a whole.
Jennifer Davis@dparzych Building resilience in systems is often around the people and processes and managing the software and systems. We don't have infinite resources, and we can easily introduce complexity in managing systems by trying to make them compensate for every single possibility. #ToggleTalk21:14 PM - 15 Apr 2020
Sociotechnical models do just this and help when it comes to resiliency. Sociotechnical theory looks at the interrelationships between the social and technical aspects. Consider how people will use the software, who will be using it, who will be supporting it. This can help you build resilience as you adapt to the changing social aspects.
When looking at the social aspects, remember that resources are finite. And people are not resources. People do not have an infinite ability to respond to and recover from failures. We use metrics to track the health of individual elements of our systems. We can also use metrics to track our own health and ability to respond to failures.
Rin Oliver is job hunting! (they/them)Another #ToggleTalk Q: How do you increase your own tolerance for disruption & failure?
A2: I increase it through practicing mindfulness, meditation, and mood tracking. I keep and use @sanvellohealth, @Headspace, and Daylio on my phone at all times, basically.21:08 PM - 15 Apr 2020
One aspect of resilience is sustained adaptability. This is where humans come in. People make decisions about what to build, how to build it, and how and when to change them. Systems will not adapt without humans. It isn’t possible to separate the human from the tech.
Rin Oliver is job hunting! (they/them)Aaaand yet another #ToggleTalk Q: What value can we derive from critical events?
A4: Everything is on fire = A great time to reflect on *what* your messaging is. Ask what you're *really* saying. Look at that messaging, who is saying it, and amplify what truly matters. People.21:25 PM - 15 Apr 2020
I love the framing of incidents as surprises. It takes away some of the negative stigma of incidents being bad. If we frame incidents as surprise learning opportunities, it helps us figure out what the best response is.
@EmotionalAPI @OSMIhelp I love @johnksawers / @EmotionalAPI ’s take on the Emotional Retrospective, too.
What a great thought to look back and go “huh, what happened here? how did it make me feel? What can I learn from that?” We do it for our computer system surprises, why not our own? #ToggleTalk22:23 PM - 15 Apr 2020
The conversation about resilience and surprises seemed to naturally lead to a discussion of mental models. A mental model is an explanation of someone’s thought process of how something works. Mental models help us understand and interpret the relationships between things. When we encounter an obstacle, we may have to update our mental models. The solution that worked previously may not work the second time around. Our ability to continually update our mental models is part of our resiliency.
I keep pushing to include the conversation about mental models in conversation about observability, resilience, experimentation and chaos engineering. It’s all about improving our understanding of how the technical side that lives “below the line” #ToggleTalk twitter.com/richburroughs/…21:43 PM - 15 Apr 2020Rich Burroughs @richburroughs@crayzeigh @dparzych Yeah true. I think that no matter how well someone understands a system, their mental model will be far from completely accurate. We're always going to be surprised by things. And we can't build something that will anticipate every possible failure. #ToggleTalk
Rich Burroughs@crayzeigh @dparzych Yeah true. I think that no matter how well someone understands a system, their mental model will be far from completely accurate. We're always going to be surprised by things. And we can't build something that will anticipate every possible failure. #ToggleTalk21:36 PM - 15 Apr 2020
During #ToggleTalk, we touched on all four concepts for resilience as outlined by David Woods (see article below):
- Ability to rebound
We need to look at technology from a sociotechnical perspective for true resiliency.
Thanks to everybody that joined in this week’s discussion on resilience. See you next week on #ToggleTalk!
There is an upcoming conference (next week on April 21st!) if you want to learn more about resilience engineering and the process for building systems that can withstand unexpected failures. You can register here: FailoverConf (free of charge!). We will be there, and so will one of our Developer Advocates, Heidi Waterhouse.
Or you can check out these recommended reads and talks:
Resilience is a Verb
OOPS! Learning from Surprise at Netflix