Unless you were living off-grid and enjoying the serenity of not being connected to the digital world, you would have heard of, or been impacted by, the largest IT disruptions the world (so far…) has experienced. The question is, how did this happen and what can we learn from this?
What happened?
At around 4am UTC, Friday 19 July 2024 (Midnight in New York or 2pm in Sydney), a number of Microsoft Windows based computers started crashing. The, now infamous, Blue Screen of Death (BSOD), started appearing on millions of computers around the world. Airports screens turned blue, ATM’s, major retailers point-of-sale systems froze, and critical systems stopped working. It was chaos. Yet, it quickly became apparent that this was not a Microsoft Windows bug or a cyberattack, but something more mundane. The cybersecurity company, CrowdStrike, announced that an update triggered a bug in their software cause the glitch.
It Happened so Fast!
The speed at which systems started failing was incredible. Even though CrowdStrike removed the faulty update within 90 minutes, the damage was done. Approximately 8.5 million systems had received and applied the update. Given the CrowdStrike Falcon sensor is designed to mitigate any new cybersecurity threat quickly, we have to acknowledge that the system performed extremely well. If this was a real threat that required an update to mitigate, then the system did its designed job with astounding speed.
The problem was that this particular update was the threat. Ouch. 😖
How Do you Fix a Brick?
In software-circles, when a computer system becomes inoperable due to a software update gone bad it is as being Bricked—your fancy electronic device is about as useful as a brick. Quite apt in the case. Normally, if an update fails, your computer might abruptly restart, or the software simply fails. In this case, the software that failed was classed as critical to the operations of the system. Meaning that when it failed, it forced the entire system to fail. Now you have a brick that refuses to do anything other than sit there.
Fortunately, a fix to this problem was quickly circulated to allow these bricked systems to be revived. The problem with the fix was that it required physical access to every computer so they could be placed into a special recovery mode to allow the offending files to be removed. The process was fairly quick and simple, even for novice users. It’s just the scale of the problem that made full recovery daunting.
Why Did this Happen?
Now the dust is settling, we are starting to understand how and why this happened. A bug residing in a complex piece of software was activated by a software update.
Developing software is hard.
Firstly, the software developed by CrowdStrike is very complex requires a team of software developers to co-ordinate their efforts to create. It is also developed using a very powerful software language, C++, that allows you control and manipulate a computer system very precisely. This is important for software that needs to operate at a very low-level. You just need to be very careful as this software language doesn’t include many protections.
With great power comes great responsibility.
Secondly, cybersecurity software needs to operate at a low-level within an operating system to be able to detect and block the nasty stuff hackers and threat-actors try to do. You also don’t want this stuff to be tampered with so that it stops working. There is some discussion in the cybersecurity development circles as to why such a system needs to operate at such a privileged layer—the kernel—to achieve their specific operational goals. Yet, that is quite common for cybersecurity products.
Sorry Dave, I’m afraid I can’t do that.
Lastly, the bug that was triggered has being lying dormant with the CrowdStrike Falcon Sensor for many months, but never actually used until last week. The update published turned this problematic code on for the first time. The code had been created to detect a new type of malware threat. Unfortunately, one part of that code tried to access a piece of memory that does not exist. A null pointer error was triggered. And because this software is marked as required by Windows, it killed the system and then stopped it from starting again.
Didn’t They Test It?
If you are thinking that this seems like a silly bug that a little bit of testing should have caught, you would be right. The failure here does point to an issue with CrowdStrike’s Quality Assurance (QA) processes. The thing is, that modern software development is complex and relies on automated testing systems. Initial analysis by CrowdStrike indicates that this particular module was reporting a pass during testing, when it should have failed.
The update deployed was a simple content update and not new code. Because of this, it would have been deemed low-risk and not undergone rigorous testing. After all, the software itself (the code thing that could and did fail) was already running on millions of Windows systems world-wide without issue, it was deemed safe. What could possibly go wrong?
What Can We Learn From This?
We can learn many things.
- Our digital world is incredibly complex and fragile. A single mistake can have wide-ranging and potentially catastrophic effects.
- Quality assurance is not optional. Test everything and make sure it works BEFORE you send it out into the world!
- Even if we don’t make a mistake, someone else might and that could cause our systems and business to fail. We need to ask our partners and suppliers how they are mitigating failure.