Beekey Cheung

Posted on Sep 23, 2017 • Originally published at blog.professorbeekums.com on Apr 16, 2017

The Problem With Heroes In Software Development

Imagine your web application goes down in the middle of the night. It’s 2 AM, but your business is global. You have users in every time zone. They’re angry. They’re unable to purchase things on your website or are canceling their subscriptions. Money is being lost every minute your web application is down.

Suddenly, one of your developers is on the case! This developer shows a little anxiety and curses a lot (it is 2 AM after all), but eventually the problems are resolved. The application is running again and money is flowing into your business. Despite this kind of situation happening from time to time, you are comforted knowing that you always have this developer to save the day. This developer is your hero.

No one in this situation should feel comfort though. It is an extremely risky position to be in for a company. It negatively impacts everyone who works at that company. And despite the immense feeling of importance, it hurts the hero developer as well.

Let’s start with the problems for the company. If there are known stability issues with an application being live, then they should be addressed. Crisis management almost never addresses the underlying issues. Crisis management is about making it through another day. The “fixes” are temporary. There is still the real problem that should be fixed.

Being comforted by having a hero on call makes those real problems seem less urgent than they really are. There is a strong temptation to put off fixing those problems in favor of working on new revenue generating activities.

But crisis management is not the same as crisis prevention.

An application that frequently goes down earns a bad reputation among users. This makes retention difficult and will eventually make it harder to gain new users. There is also the possiblity where the hero is unavailable. There are many ways this can happen. What if they get sick? Or a relative gets sick? Or they go on vacation? Or they just decide they’re tired and quit?

In all those cases, your application is now down for a much longer period (hours instead of minutes or days instead of hours). You could argue that someone else on the team can take on the mantle of hero. That leads us to the problems for everyone else on the team.

What would you do if there was a crisis at your job and you couldn’t do anything about it?

Those in this situation can get a feeling of helplessness. This can damage a person’s confidence which often affects job performance. People can ask the hero to teach them to solve problems as well, but why would a company prioritize that education when it won’t prioritize fixing the real problems?

More importantly, people can become reliant on the hero as well. They decide that their silo is their own work and the hero is the one who handles crises. They don’t need to learn how to help out.

This is a dangerous attitude to have for developers. The best way to prevent crises, or at least make them more manageable, is to build software that makes it easy to prevent or manage crises. How can a developer know how to properly account for a crisis if they haven’t been in one?

They can’t.

Even if the hero explains the technical details after a crisis resolves, there will be important details missed. The most important one is the emotional state the hero was in after waking up at 2 AM in the morning in response to the crisis.

Developers make dozens of tiny decisions every day in their code. Many of those decisions can save precious minutes, if not hours, in a crisis. Someone who just hears about something that will help in a crisis will not truly understand its importance as much as someone who experiences the result of a bad decision while in a crisis.

For example: many developers who have not experienced a crisis will tend to write poor error messages. Their code is littered with messages like “Error occurred.” Where did the error occur? Who did it occur for? Even something as small as “Error occurred for User 123 at url /home” makes a huge difference. But someone who has never had to fix critical issues will not understand how big a difference it makes. They would have never felt the emotional impact of these seemingly small changes in their code.

Writing code that handles well in a crisis is an essential skill for developers. When developers rely on a hero to solve crises, they are denying themselves the opportunity to develop the skill to write better code. That will impact the company in the short term and the developers’ careers in the long term.

Lastly, there are the issues for the hero. Having the skills to save the day make the hero valuable in a way. But going from crisis to crisis will have the hero only develop the skill to resolve crises. The hero will not develop their ability to prevent crises. If the company can’t prioritize crisis prevention, the hero won’t have time to practice crisis prevention. This affects the hero’s career because their value is tied to a single company. They have less value to another company which affects their ability to move on if they find they are unhappy.

And they will be unhappy.

Being the hero has a number of quality of life issues. Want to take a vacation? Sure, but always have the oncall phone ready and be prepared to take out a laptop at a moment’s notice. Want to build something new and interesting? Sure, but do it in between crises. Want to make dinner plans with friends or go on a date? Sure, but be prepared to cancel. Just in case.

I’ve been both the hero and the developer reliant on the hero to save the day. I can honestly say that it is worse being the hero. The praise and the adrenaline can feel great at first, but it doesn’t last. Eventually, there is only exhaustion, resignation, and anger.

“How could that break again?!”

Resolving crises without crisis prevention has a diminishing return on growing as a developer. Eventually you just end up solving the same kind of crisis over and over. There’s no learning in that. I also ended up in a situation where I needed a crisis even though I hated it. I was so used to resolving them that I didn’t know how to function when there wasn’t a crisis. How does one go about doing work uninterrupted? So strange!

So how can we get rid of the culture of hero developers?

The idea is simple, even if implementing it can be challenging. Treat the notion of having a hero as seriously as you would a crisis. When a crisis happens, resolve it. But also take at least one step in preventing something just like it from happening again. It isn’t a guarantee of prevention, but slow progress is still progress. You will eventually get there.

Also prioritize education for the rest of the team. Involve multiple people in every crisis. Maybe that’s just investigating an issue in parallel with the hero. Maybe that’s pairing up directly with the hero. But involve them. Everyone learns better by doing. It may feel like wasted time since the hero can do thing faster, but having multiple people capable of resolving future crises is worth that cost.

Neither of these steps are easy to take. It’s never easy trying to think of the long term when you have an emergency. But for all the reasons stated above, these are important steps to take because they prevent the vicious cycle of having only a single hero to solve crises because the hero is the only one who has ever solved a crisis. It is worth the cost.

This post was originally published on blog.professorbeekums.com

Top comments (20)

Rix • Apr 28 '17

This should be required reading for anyone involved in software. So glad to see someone else experienced this. Literally everything thing you've described happened to me in my lat job.

When it happened to me I actually called myself a firefighter rather than a hero though. I was actually promoted partly because of my firefighting ability which seemed odd to me. I totally agree that if you gain notoriety fighting fires then suddenly there's not much motivation to try preventing them and when there's no fire to fight you feel useless. In the end I wasn't able to affect the kind of changes I wanted as a manger as I didn't have the stomach to fight the CTO (who was also a very intelligent and stubborn firefighter) and I'll admit I enjoyed the feeling of being depended on as much as I hate the stress of the problems themselves.

Sorry for the essay just wanted to say how much this resonated with me and confess my sins : )

Boris Kozorovitzky • Aug 16 '17

I used to call it firefighter too!
One time at a stand-up meeting when I, once again, had to tell everyone I was solving a crisis (putting out fire). I asked the VP to buy me a fire fighter hat so that I don't have to come to these meetings during a crisis. I will wear the hat and everyone knows no to bother me :D

Beekey Cheung • Apr 28 '17

I love that we almost have the exact same story!

One issue I've found is convincing CTOs and VPs of engineering who love heroes that they should aim for a different engineering culture. It's something that's hard to argue away with logic and even harder to prove.

Nick Ma • Apr 29 '17

I think it takes your own time to prove. Once you are in this kind of environment, you are required to put in extra time outside of your work duties to bring it into the team. (ex-Amazon, startup) work.

Since if you are "wasting time" fiddling with infrastructure, why are you not doing actual sprint work.

Kasey Speakman • Apr 28 '17

This is a great article and it resonates with me as well.

One additional aspect I would add: if you are looked on as the hero, management starts to expect everything to be solved as though it were a crisis. Even new features are now presented as crises. This lends itself to clever, but unmaintainable code. It's clever because you are smart. It's unmaintainable because you are in hurry-up-and-duct-tape-it mode instead of spending time to design.

Followed to its logical conclusion, the code you develop is a fragile ball of mud and the job is unbearably stressful. Worse, a new dev from a more-healthy environment hires in, and shines a light on the fact that your code is crap. You can't believe it because your hero experience leads you to the opposite logical conclusion about your skill.

It's a setup for failure of the software as well as personal failure for the hero.

Beekey Cheung • Apr 28 '17

That's an interesting perspective. I've usually seen heroes, including me, suffer professionally mostly because we spend so much time fighting fires that we don't have time to actually build new features. We end up letting that skillset rust. What you say makes a ton of sense though.

Ben Halpern • Apr 27 '17

This is so true. We are working hard with dev.to to massage out any hero need that currently exists and set ourselves up to avoid this situation as much as possible in the long run.

Nick Ma • Apr 29 '17

Awesome, this kind of change is definitely an institutional mindset. Cheers to happier devs in the long run.

Its always a sad when you propose to setup proper monitoring, log systems only to be shutdown by management to spend time on more revenue generating projects.

Samuel Nitsche • Jan 13 '18

Another tiny note from experience:
Being the hero won't make you as valuable as you might think. A management who doesn't invest into avoiding crisis is obviously unexperienced or incompetent and will not hestitate to discipline or even Fire you if your criticism about the sources for the crisis gets more Intense.

Matt Anderson • May 28 '17

Thanks for writing this excellent piece. As a product manager, I relate a lot to the persona for "the developer reliant on the hero to save the day." Being reliant on heroes is a common position for a product manager to be in.

Beekey, what can the work prioritizer (product manager, business analyst, scrum master, etc.) do to enable multiple crisis-solvers on the team? Say Bob is our hero and Julia is our mid-level dev. Should we just have Bob and Julia tag-team the crises, or should we attempt to give some of Bob's normal non-emergency work to Julia, too?

Beekey Cheung • May 28 '17

Definitely have them pair in the crisis. That's the best way to learn.

Also make sure to do post mortems of every crisis. Talk to Bob about letting Julia provide suggestions for improvement instead of speaking right away. That'll help Julia learn as well. Create action points, and possibly stories, for crisis prevention. Treat doing at least one of those tasks as important as handling the crisis itself. This will reduce the need for crisis management.

Let me know if there are more specific situations you'd like to discuss.

Riccardo Bernardini • Dec 5 '17

" If there are known stability issues with an application being live, then they should be addressed"

Indeed. This is the real issue. Your software should be fireproof, you should not rely on having a fire-fighter at hand.

Justin Riedyk • Apr 28 '17

Excellent read, I'll pass this one around the office for sure.

John Daniel • Apr 28 '17

Don't they just call this "DevOps" now?

Beekey Cheung • Apr 28 '17

You can end up with the same problem if you have one person doing DevOps. Everything with prevention in software development also applies to sysadmin work.

John Daniel • Apr 28 '17

Sorry. I was just making a statement about how this practice has been institutionalized. I don't consider System Administration to be the same as DevOps. Why else would it have a different name? Back in my day we called it "integration engineering". It was a process without heroes.

arunxjacob • Sep 15 '17

I inherited an organization that not only had heroes, but celebrated heroic efforts. One of the hardest things to change in the culture has been reliance on specific heroes, and the general sense of helplessness felt by everyone else without those heroes in the room. Fortunately, the heroes are tired of being heroic, and once given the space to implement the kind of systemic change that is required, have done so with enthusiasm. One of them sent me this article, as he was taking some time off to visit friends. We are still working on the powerlessness/helplessness in the rest of the org, which has dropped significantly in the teams that have started to own their quality and operational integrity. We have a ways to go, but I'll take any victory. It's not the kind of 'bottom line' progress that gets celebrated in the board room - it's the kind of progress that leads to sustainable bottom line success.