A short reflection on the recent CrowdStrike IT disaster.
I feel like there's a lot to learn from the #CrowdStrike meltdown where a bug in a software update is causing havoc across the world. Here's what immediately comes to mind, both from the perspective of the company, and all of us as a society.
1. Don't put all your update eggs in one basket.
If you're a global service provider, unless you're sending a critical security patch, do you really need to go for a global rollout, or can you do it in batches? That way, if something goes wrong, you limit the fallout.
2. The importance of testing, and being responsible.
Why did the faulty code pass the CI/CD checks? As an avid software tester, I can't help but wonder how CrowdStrike's systems are set up. If you are a service provider for critical societal infrastructure like hospitals and aviation, I feel that you have a responsibility to have solid testing pipelines before releasing an update. Of course, things can still be missed, but again, I can't help but wonder about the robustness of their delivery pipelines.
3. The single point of failure problem.
Why aren't we more concerned about relying on single points of failures? It's quite honestly frightening how much chaos can be caused due to a single mistake, and it's even more terrifying the direction we're heading in. People in our industry have been moving towards relying on outside entities, and now everyone is paying the price.
Looking to the future.
This essay got dark. So let's try to end it on a more positive note. What can we as IT professionals do to prevent this from happening again? How can we make sure that we're not the next #CrowdStrike? I think that's a conversation worth having.
For now, I'm going to go back to my testing, and make sure that my code is as solid as it can be. I hope you do the same. And do let me know your thoughts on this whole situation. I'm curious to hear what you think and what you would do differently.
Top comments (19)
Thanks for the post
Glad you liked it! Feel free to share it if you think others would find it interesting :)
This is exactly the same thing I have been telling people. This should have been caught in QA testing and after the production deployment, there should have been verification that it was working, if not then rollback immediately. If you’re a provider of software for any company, you should be doing phased rollouts and automated testing. I’m wondering was this test scenario just missed? Seems like a big one:)
Devs making software are never perfect ; so bugs happen. Tester doing QA are never perfect ; so bugs could remain unspotted and put into production ? IMHO, it's very hard to measure the quality of your tests. If you have no bug, you can only measure how expensive it is to make and maintain tests. When you wreak havoc in production, it suggests that there is a flaw in your QA. My point is : it's easier to measure test quality when it is low.
Thanks for sharing, appreciated a lot the points just mentioned, they remind us about how we should be careful in a lot of aspects, even having a robust delivery system, or having a great team, shit happens and we need to be prepare to handle in the best way.
Also. Sometimes the cure is worse than the problem.
Did you mean "worse"?
Yes. Thanks.
I heard on some news channel that they couldn’t test the update for every machine out there, since there’s like so many makes and models: windows PCs, servers, workstations and whatnot. Something to this extent. I’m not defending, just stating.
Then again, I’m totally for the incremental delivery/rollout. Maybe they should’ve targeted a specific zone first.
Thanks for giving this context, I do disagree with them though. They provide services to critical parts of society, and evidently have the power to grind our lives to a halt. They have a responsibility to ensure they have adequate measures to prevent these things from happening. That's the bare minimum.
They are (were?) a multi-billion dollar corporation. They should have a fleet of all the hundreds if not thousands most common deployment devices they provide services to, so they can actually test their software before releasing it.
As a tester, I had very similar questions! Thanks for the post.
The big problem here is Windows update and how it is configured. Windows update corporation should be configured to go against a local update server. Local updates server and then call home for various pieces of infrastructure and various software updates. Then, you can stage to update yourself across your fleet.
It's amazing how transparent CrowdStrike has been in their post-incident report. I'm glad to see they are now implementing a comprehensive QA process.
Feels a bit too late don't you think?
Don't get me wrong, the damage has been done and I'm sure we'll continue to see new updates about the fallout. I just see this as a positive because it sets some standards for publicly traded companies to reference.
Bro, what test, by the analysis they made on twitter the file was only a bunch of zeros. That makes no sense at all.
Turns out the null bytes were caused by CrowdStrike's crash in the middle of the update, the content update they released wasn't full of null bytes. What happened was they released a configuration file that caused an out-of-bounds memory error, setting off a chain of odd behavior for the computer.
Link?
The retro posted by CrowdStrike seems to indicate that wasn't the problem.