As an engineering manager, culture is one of the most important aspects of my job. You want to create the optimal environment where the engineers can make a difference. To be truly innovative you have to balance on the edge of your capabilities. That means you cannot be right 100% of the time. But how to create a culture where failing is acceptable?
There are three key aspects to creating a culture where failing is acceptable. Let's take a closer look at those aspects, and find the secret on how to create a culture where failing is acceptable.
The single most important part of how to create a culture where failing is acceptable is psychological safety. And high-performing teams need psychological safety. People will never open up about their mistakes if they are afraid of getting punished. Needless to say, support from higher-ups is key in creating this culture.
Managers should lead by example and share their own failures as well. Instead of brushing it off, share when something backfires completely. People need to know it is okay to make a mistake.
The behavior we want to see here is something I wrote about in one of the previous posts. The theory Y type manager in the McGregor's theory post is the type we need to build this culture.
What is considered failing? The range of failures is big. On one end, you have a small bug that slips by in testing and has a little bit of user impact. On the other end, you have someone that accidentally deletes the production database, and lets the department run overtime the entire weekend to recover.
What does failing fast mean? It means that you can quickly recover from failure. I've worked on projects where releasing took about 35 minutes. This meant that any time you released a significant bug or had an issue, it would take at least 35 minutes to recover. If that happens during peak hours, and for example, your customers are unable to use your product, that hurts. Imagine a situation where you can fail fast. You have a blazing fast production pipeline, that makes rollback super easy. No harm done, maybe a few users noticed before your alerting was triggered and decided to roll back.
Fail your way to success
To make it possible to fail fast, you need to make releasing and enabling features as easy and fast as possible. What worked well for me in the past are:
- Your releases needs to be as fast as possible. Notice something went wrong? Rollback in 2 minutes. That would be a lot more painful if your realease to production pipeline is 35 minutes.
- You get bonuspoints if you have green/blue deploys, and you are able to switch back to a previous version.
- Feature flags can save you if your pipelines are slow. Instead of having to release your product. You can turn of the feature that is throwing the errors.
- Alerting and monitoring is key. To fail fast, you need to spot issues fast as well.
The best Dutch footballer ever described learning from failures in the best way:
I learn from the mistakes of others, so I won't have to make them myself
Learning from failures is something you should celebrate. At work, we often have joked about giving out awards to whoever made the biggest mistake of the week. Or give someone an award the first time they break production.
To get the maximum amount of benefits from a culture where failing is acceptable, make sure you learn from the mistake of others as well. What works well for me is:
- Organize sessions where you analyze failures to learn from them. Our site was down? Great! How did it happen, what can we learn from it, and how can we prevent it next time?
- The root cause is key. Try to analyze the root cause, and make plans to prevent it in the future.
The goal of learning from your mistakes is to not repeat them. Failing is acceptable, failing multiple times on the same thing is not.
How to create a culture where failing is acceptable? There are three key aspects:
- Without psychological safety, you cannot succeed. Let managers lead by example and share their failures.
- Create your environment in such a way that you can quickly recover if anything goes wrong. Fast pipelines and feature flags can save you.
- Organize in such a way that you discuss failures, and learn from the root cause. So you won't make the same mistake again.
I hope you liked this post, and you now have learned how to create a culture where failing is acceptable. Subscribe to the newsletter to receive other great articles right in your mailbox.