Page It to the Limit
Chaos Engineering With Bruce Wong
Creation of the term “Chaos Engineering”
Bruce tells us about how the term “chaos engineering” came to be and the mindset behind using the term.
“Let’s create a team strategy and vision around [Chaos Monkey and the practices around it] and let’s double down on what we already started. So in that fashion, we wrote a blog post that introduced the term ‘Chaos Engineering’ and introduced the term ‘Chaos Engineer.’”
What does Chaos Engineering really mean?
Bruce breaks down the pragmatic reasons this practice exists and why we should think about adopting it.
“It’s being proactive and getting a chance to validate our resilience design: finding out how well our systems are architected at 3pm instead of 3am.”
Chaos Engineering Thought Exercises
We discuss how tabletop thought exercises serve as a valuable tool to help you flesh out considerations long before touching any production systems.
“We call it ‘zero tech’ tabletops. I don’t want laptops. I don’t want distractions and excuses for why we can’t get started. And so I run these tabletop exercises, with a whiteboard, with a drawing of the architecture and we talk about our detection strategy, resilience, trade offs, and the parts that fail.”
But I’m not ready for Chaos Engineering!
A common response to the suggestion that a team adopts Chaos Engineering is that they’re simply not ready to get started. We discuss some ways to address these concerns.
“If we’re not ready for this, then are we really ready for production? Ready or not, failure is going to happen.”
Identifying big impact components to test
How do you prioritize which components of your stack to test? What are the considerations for figuring out where to start? Bruce gives some practical advice for where in your stack to start and finding opportune moments to seize upon.
“Cloud provider outages… are the best opportunities. They allow us to identify and be introspective about the things in our control that we can do about this.”
When should you start?
No, really. Big outage aside, when should we get started? Here’s where we see George’s managerial background kick in. Can we start today? Bruce provides some great practical wisdom around getting started as early as when new team members are being onboarded.
“When’s the time you want to start writing more resilient software?”
When the real outage happens
It’s important to celebrate wins. The Chaos Engineering wins are when you’re the team relaxing as a failure happens.
“You’re celebrating because this thing failed exactly as we planned! It happens and there’s nothing for us to do. We’re just sitting back and watching the show.”
Capturing what we learn from Chaos Engineering
Building more resilient systems means taking the things we learn from Chaos Engineering exercises and ensuring that resulting action items make it into our work streams. How can teams do that successfully?
“The first time I did this, we did sprint planning and then the chaos engineering exercise. Nope. That’s the wrong order!”
Parting Advice
Bruce wraps up with practical tips for moving your teams in the right direction.
“You don’t need fancy tooling. You need 3 lines of code: if my user, fail this call.”
Additional Resources
- PagerDuty Home Page
- Episode transcribed by Rev