July 19th was a stressful day for me. My parents were at the hospital dealing with an injury, and I was at home, waiting to see how it would go. To then hear that a massive global outage left millions computers unusable for sure left me with unease.
Thankfully, my parents returned home and the operation went successfully, but I was amongst the few that got a happy ending to their stories that day. Others were stuck at airports or hospitals, or couldn't get any work done because of their broken PCs.
The root cause of these problems was a software called CrowdStrike Falcon, which just so happened to push a faulty update that day, leaving many Windows computers in BSoD limbo. In this post, I decided to analyse the situation more carefully, and discuss the outage from multiple angles.
What caused the issue
CrowdStrike Falcon is marketed as the final form of an antivirus. It works as a kernel mode driver (running in ring 0), and monitoring everything that happens on the computer. Like any good antivirus, you need malware definitions so it can differentiate between malware and legitimate software, and the C00291*.sys files we were instructed to delete were just that.
These malware definitions need to be updated periodically, to ensure that systems are protected as soon as a new threat is discovered. In the case of Falcon, these updates happen automatically, without much user intervention. And it's one such update that caused an internal error, and therefore a crash.
The kicker is that because Falcon operates in kernel mode and loads with the system, its crash dragged down the entire operating system with it every time the system booted up. This is why you had to go into Safe mode to delete the files, because only critical drivers are loaded (albeit if you're using Bitlocker and haven't backed up the keys you're out of luck).
Not a developer issue
Some people on Twitter are speculating that this problem is caused by memory mismanagement. While that may be true, in reality the developer team behind CrowdStrike isn't at fault. Instead, CrowdStrike claims it's a QA issue, or in other words, the update slipped past many of the tests they implemented that could have prevented this issue. Quality assurance screwups are nothing new in the world of software, but with a company like CrowdStrike, whose software runs on millions of computers worldwide, small mistakes like this just can't happen.
I'm probably going on a tangent here, but this whole issue reminds me of Boeing, whose 737 Max not only crashed twice, and took many lives with it, but it also gave many people a near-death experience, after the plane door flew off. Of course, both companies operate in completely different sectors, but what they have in common is that in these sectors, even the smallest negligence can be catastrophic, and it's these small mistakes that proved costly for both companies, whether we're talking about money, lost time or human lives.
The damages are more than $5b, with 8.5 million computers affected in total, and the company's response with $10 giftcards (which didn't work anyways) only rubbed salt on the wound.
At the moment CrowdStrike claims that 97% of affected systems have recovered, but to think that hundreds of thousands of systems are still unoperational is quite unsettling.
Fun fact about CrowdStrike
As an added bonus, it turns out that the current CEO of CrowdStrike, George Kurtz was also McAfee's CTO during 2010, when that company was dealing with an outage of similar proportions, rendering many Windows XP computers broken.
Why Microsoft is also at fault
While the blame falls mostly at CrowdStrike, some are also pointing fingers at Microsoft, because they architected Windows in such a way that allowed for such outage. In response, Microsoft passes the blame onto EU who forced them to open up their OS as an antitrust settlement. EU later denied any responsibility for the outage.
While it's understandable why EU would force such move onto Microsoft, to prevent them from having an unfair advantage, there are many things Microsoft could have done better to prevent this problem. Here are a couple of examples of the top of my head (keep in mind I'm not a kernel/driver developer):
- Handling drivers preventing Windows from booting
- Better recovery from system crashes
- Ensuring drivers don't crash the entire OS with them, even ones that are not critical to the system
The ironic thing about this is that Microsoft DOES acknowledge that software should move away from kernel space and are actively working to make this happen, with features like VBS enclaves or Azure Attestation. So to me it looks like this was just to move pressure away from them.
Anticheats are also problematic
Some users will point out that anticheats work in a similar way to CrowdSrike, in that they use kernel mode drivers that start up with the system, and monitor the system for bad behaviour (cheating in this case). This raises the question as to whether another outage, with gamers this time around, could happen, and I definitely see all the ingredients needed for such a disaster.
For one, game companies are focused more on their game revenue being up as opposed to the safety of their software, and as a result these companies don't have all the experience and attention to detail needed to develop such low level software. And indeed, anticheat software has caused issues like when Vanguard was just introduced into LOL and PCs we're falling like flies.
Youtubers like Low Level Learning and Gardiner Bryant have made videos going into more detail about this issue, and I recommend you check those too. This post focuses on the CrowdStrike outage, but I thought this was relevant here.
What about other platforms?
The keen eye among you will point out that CrowdStrike Falcon is also available on Linux and Mac OS. Thankfully, those platforms were not affected that day, but that doesn't mean that they are immune to such problems. In fact, Falcon did cause some issues on Linux in April, but it was less severe than Windows. Just imagine if all Linux systems were affected, International No-Internet Day wouldn't even begin to cover the gravity of the situation.
Closing thoughts
The CrowdStrike outage has affected many industries worldwide, and there are a few important things we learned. The most important one is that having one company like CrowdStrike control the security of most enterprises is quite a bad idea, and we just saw the implications of that. But also, having code that can access a system at such low levels is a disaster waiting to happen, and we should move away from software with this much access over one's system.
Top comments (1)
Great read!