loading...
Cover image for Even the Big Ones Mess Up

Even the Big Ones Mess Up

dstarner profile image Daniel Starner ・2 min read

This morning, my feed blew up due to my friends and family complaining about some new Instagram overhaul. Apparently, their feeds were scrolling horizontally instead of the usual vertically. It turns out that this was just one big Oops on the part of Instagram's engineering team, as tweeted by their Head of Engineering.

When stuff like this happens, it's funny to see how quickly the reactions both for and against the change propagate across social media, but in another way, it makes me feel better about being a developer.

If They Can Mess Up, so Can You

Stories like watching Instagram accidentally over-scale their testing of the new UI, or Amazon's Alexa crashing on Christmas due to the influx of new devices makes me realize that no matter how big or powerful these companies are, they are still run by humans, and humans miscalculate and make mistakes.

So if Instagram or Amazon can make these mistakes, why do I give myself so much trouble for writing buggy code sometimes? No one can see all the use cases and outcomes of running their software, and mistakes do happen.

You, me, Amazon or Instagram...we will never write perfect software or always get things right, because there is no right way or perfect software. Whatever works for you, your team, or your company at the time is good enough until you have to make modifications for new user/edge cases.

If we as developers kept programming until we thought our code was "perfect" then either it wouldn't really be perfect, or we'd never finish it! Design and plan ahead as best as possible, but don't beat yourself up for writing buggy code, because it's natural and just happens. If there wasn't any buggy code or systems to fix, a lot of engineers would be out of jobs πŸ™ˆ

I will not make mistakes

No matter how much we prepare, we will make mistakes, that's just a part of life. What matters is how we face those mistakes and issues, and the tenacity we bring to making software better. These ideas scale, whether we are just solo developers, or are the big companies like Instagram.

So what are your thoughts? How much fuss should we give companies this big when they make mistakes? What are your thoughts on writing code that walks the line between being good and being good enough? What are your thoughts on the balance between tech debt and time to release?

Posted on by:

dstarner profile

Daniel Starner

@dstarner

I am a curious person who enjoys figuring out the building blocks of the world, and rearranging them to build something even better.My career is developing software, but my life is adventuring.

Discussion

pic
Editor guide
 

Every time I encounter a 500 error in the wild I breathe a sigh of relief.

 

No one is saying that mistakes are meaningless or can be swept away without concern. Mistakes are action items and opportunities to grow as professional and as individuals.

Consider this excerpt from @dstarner 's original post:

No matter how much we prepare, we will make mistakes, that's just a part of life. What matters is how we face those mistakes and issues, and the tenacity we bring to making software better.

Having lofty expectations for yourself is fine, but for a lot of people, failing to reach those lofty personal expectations leads to severe stress and anxiety. Learning to cope with your mistakes and grow is an important personal development skill. Many people can be paralyzed for fear of making a mistake, and this post (as I understood it) seeks to alleviate some of that paralysis.

 

I agree with this completely. I'm pretty sure you just summed up my thoughts better than my post did πŸ˜‚

 

Modern medicine is built upon trial and error. LOTS of mistakes made. People died because of those mistakes.

Think about what a doctor does and what a software developer does. It is very much a debugging exercise. "What hurts? When did it start? What factors surrounded this happening? Does it hurt when you do this? Or only this?"

They're debugging you. And when pressed, they'll admit: they're doing what they can with the info they have and there is no 100% solution most likely. They are taking guesses; educated guesses of course. But guesses.

 

I don't think I have seen a better explanation of what a doctor does. Entirely on point, and to be honest, it makes me even more grateful people are willing to pick up the role to debug us, humans.

Fact is that it is impossible not to make mistakes, whatever it may be we set out to do. And the world as we know it is built by mistakes and takeaways from them. Without error, we would still be playing it safe in caves with flint fire.

We have to mention that, the level of responsibility is vastly higher with doctors than that of a software developer, or any other profession, really.

 

Lots of great questions at the end there and I think for a lot of cases "it depends" e.g. a mistake with software that puts someone on the moon vs Instagram swiping issues.

In general I think we ought to give each other a bit of break and focus on collaborating more to make things better.

I have a good laugh every time I run into a bug or server issues out on the internet because I've been there. It happens.

Then you will be eternally frustrated.

Doctors were harder on themselves than patients were when it came to judging their ability to minimize the pain, discomfort, or disability caused by a condition. Only 37 percent of physicians thought they were "very" effective, though 60 percent more thought they were "somewhat" effective. But 79 percent of patients said their doctor helped to minimize their pain or discomfort. -- Consumer Reports

You're suffering from contempt borne of familiarity. You know everything that's wrong in the software world, and nothing about how messed up the medical world is. How 30% of new doctors suffer from depression. And, speaking from personal experience, how freaking elitist an MD can be. Nor have you conversed with people who have chronic illnesses and have taken tons of different drugs with various side-effects and the doctors just move on to the next; I'm not sure if they publicly exist, or are kept behind closed doors, but "everybody makes mistakes" is obviously the norm in that world.

"The only doctor who never loses a patient is one who doesn't try to heal." Which I don't want to be overly-critical of the medical field; I wouldn't even be remotely familiar enough with all the stuff going on to be able to give a truly informed critique beyond the surface level.

 

I'd be interested to see how long it took them to recover. If we are truly in the blessed days of DevOps and Business Agility, then their Mean-Time-To-Recover is all that really matters, yeah?

That being said, as long as you figure out the root issue and put a test in place to make sure that doesn't happen again as part of the deployment pipeline, then it was actually time well spent.

 

You should always make a big fuss about mistakes which happen in production. Of course mistakes happen, but why didn't this mistake happen in Test or Acceptance. What do you need to fix to the way you go to production so that this does not happen again the next time.

Quite often production issues are the result of good enough mentality. The quality should be good, it does not have to be perfect. Something is good when it works, and you have proof that it works. It might not be the best in performance or scalability, but you know where its limitations are. And here is where something that is good can still fail in production. For example, in case of the Amazon issue they probably had in incorrect estimation of the surge of new devices. And that's fine. But if they did not even consider for it, and test for it. Then they deserve all the fuss that should be made about it.

In production there should only be two cases of issues which are (kind of) acceptable.

  1. "Oh fuck!" Usually the result of somebody performing an explicit action, like deleting a file on the wrong server.
  2. "That's interesting." Something happens which defies the world as you have defined it. These are usually the result of a user performing a combination or series of actions which where not accounted for in the logic.

Both these issues are not really solvable. You can only reduce the number of occurrences. This is what defines your software/process maturity.
You can attempt to expose these problems by employing things like chaos engineering and fuzzing testing. But that only gets you so far. In fuzzing testing you generally only try to find the edge cases of a single unit. But for the "That's interesting" you probably need to invoke a whole series of edge cases.

 

100% this. ^

I'm now changing incident report reasons to "Oh Fuck" and "That's interesting...".

So. Much. Yes.

 

This is one of the best reads to close the year with

 
 

You, me, Amazon or Instagram...we will never write perfect software or always get things right, because there is no right way or perfect software.

I don't want to discourage you (actually incourage), but there are people that build perfect software. And that's not just the tutorial "Hello world!", but enterprise level software with zero defects, build on time and on budget.

They just made lots more mistakes and improved their approach faster and cut things short to just one methodology. And it works.

Even better, these people can teach you how to do it too. And it's essence is very simple, but not easy.

But I agree, all people mess up. And it's good to realize this fact.