DEV Community

Cover image for What we can learn from the #CrowdStrike meltdown.

What we can learn from the #CrowdStrike meltdown.

CodeWithCaen on July 19, 2024

A short reflection on the recent CrowdStrike IT disaster. I feel like there's a lot to learn from the #CrowdStrike meltdown where a bug ...
Collapse
 
ben profile image
Ben Halpern

Thanks for the post

Collapse
 
codewithcaen profile image
CodeWithCaen

Glad you liked it! Feel free to share it if you think others would find it interesting :)

Collapse
 
jennyphan profile image
Jenny Phan

This is exactly the same thing I have been telling people. This should have been caught in QA testing and after the production deployment, there should have been verification that it was working, if not then rollback immediately. If you’re a provider of software for any company, you should be doing phased rollouts and automated testing. I’m wondering was this test scenario just missed? Seems like a big one:)

Collapse
 
dricomdragon profile image
Jovian Hersemeule

Devs making software are never perfect ; so bugs happen. Tester doing QA are never perfect ; so bugs could remain unspotted and put into production ? IMHO, it's very hard to measure the quality of your tests. If you have no bug, you can only measure how expensive it is to make and maintain tests. When you wreak havoc in production, it suggests that there is a flaw in your QA. My point is : it's easier to measure test quality when it is low.

Collapse
 
eduardopatrick profile image
Eduardo Patrick

Thanks for sharing, appreciated a lot the points just mentioned, they remind us about how we should be careful in a lot of aspects, even having a robust delivery system, or having a great team, shit happens and we need to be prepare to handle in the best way.

Collapse
 
syeo66 profile image
Red Ochsenbein (he/him) • Edited

Also. Sometimes the cure is worse than the problem.

Collapse
 
chasm profile image
Charles F. Munat

Did you mean "worse"?

Collapse
 
syeo66 profile image
Red Ochsenbein (he/him)

Yes. Thanks.

Collapse
 
gahunda profile image
S. Ben Ali

I heard on some news channel that they couldn’t test the update for every machine out there, since there’s like so many makes and models: windows PCs, servers, workstations and whatnot. Something to this extent. I’m not defending, just stating.

Then again, I’m totally for the incremental delivery/rollout. Maybe they should’ve targeted a specific zone first.

Collapse
 
codewithcaen profile image
CodeWithCaen

Thanks for giving this context, I do disagree with them though. They provide services to critical parts of society, and evidently have the power to grind our lives to a halt. They have a responsibility to ensure they have adequate measures to prevent these things from happening. That's the bare minimum.

They are (were?) a multi-billion dollar corporation. They should have a fleet of all the hundreds if not thousands most common deployment devices they provide services to, so they can actually test their software before releasing it.

Collapse
 
auraswap profile image
Liz Wait

As a tester, I had very similar questions! Thanks for the post.

Collapse
 
matthewpersico profile image
Matthew O. Persico

The big problem here is Windows update and how it is configured. Windows update corporation should be configured to go against a local update server. Local updates server and then call home for various pieces of infrastructure and various software updates. Then, you can stage to update yourself across your fleet.

Collapse
 
qacomet profile image
Lucas@QAComet

It's amazing how transparent CrowdStrike has been in their post-incident report. I'm glad to see they are now implementing a comprehensive QA process.

Collapse
 
codewithcaen profile image
CodeWithCaen

Feels a bit too late don't you think?

Collapse
 
qacomet profile image
Lucas@QAComet

Don't get me wrong, the damage has been done and I'm sure we'll continue to see new updates about the fallout. I just see this as a positive because it sets some standards for publicly traded companies to reference.

Collapse
 
theooliveira profile image
Theo Oliveira • Edited

Bro, what test, by the analysis they made on twitter the file was only a bunch of zeros. That makes no sense at all.

Collapse
 
qacomet profile image
Lucas@QAComet • Edited

Turns out the null bytes were caused by CrowdStrike's crash in the middle of the update, the content update they released wasn't full of null bytes. What happened was they released a configuration file that caused an out-of-bounds memory error, setting off a chain of odd behavior for the computer.

Collapse
 
codewithcaen profile image
CodeWithCaen

Link?

Collapse
 
michaelmior profile image
Michael Mior

The retro posted by CrowdStrike seems to indicate that wasn't the problem.