Monthly Archives: July 2024

The CrowdStrike Incident: What happened, How it failed, and How to avoid it into future

On Friday, July 19th, a faulty update to CrowdStrike’s Falcon platform triggered extensive disruptions in Windows systems, resulting in global computer outages lasting several hours. As recovery efforts continue and investigations into the root causes progress, I would like to take a moment to summarize the incident by focusing on three key areas: what happened, how it failed, and how to prevent a similar issue in the future.

What Happened

Let’s try to detail the incident by providing a timeline.

  1. On July 19, 2024, at 04:09 UTC, CrowdStrike pushed out a patch or feature to enhance the security features of its endpoint protection solution, Falcon, to detect the target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks.
  2. The update includes a channel file named C-0000291*.sys, which specifies monitoring and response rules for the sensor. This file contains a logic error that causes memory allocation issues when running the changes.
  3. Upon deployment of the patch on Windows systems, the updated file runs at the kernel level (because the security software needs to detect malicious security issues as early as possible even at boosting period). The file’s logic errors trigger memory allocation problems. Windows’ instability detection mechanism interprets these errors as a critical failure in Falco and it could potentially compromise system stability. This leads to a Blue Screen of Death (BSOD) event to prevent further damage in the windows machine.
  4. The BSOD issue was spread out due to the massive usage of windows machines, An estimated 8.5 million computers experienced BSOD issues, resulting in widespread IT outages across various business sectors globally.
  5. CrowdStrike released a fix on July 19, 2024, at 05:27 UTC. As a result, affected servers began to recover gradually, although some machines still remain non-operational due to the need for physical access to reboot them.

How did it happen?

Everyone, including myself, wants to understand how and why this happened. I have three main questions after seeing the widespread news coverage about this global IT outage. For example:

  1. How was the logic error missed during quality assurance testing?
  2. What is CrowdStrike’s deployment rollout strategy, given the global impact?
  3. How could a third-party vendor cause a system crash, and are there any controls in place on Windows machines to prevent this?

We can’t be entirely certain about the factors that contributed to this failure. However, here are some of my speculations based on the numerous posts I’ve read about this outage.

Inadequate quality checks in testing, a bad tradeoff between speed And risk 

As a security company, I believe CrowdStrike operates at a high velocity in software development and depoyment to address the latest security threats. I can attest to this from my own experience working for a security company in the past.  It means that they have to push out fixes and patches very frequently.

Due to the high velocity of deployment and deployment, the workload for QA and developers can be immense, potentially leading to gaps in quality assurance testing when they try to strike a balance  between speedy delivery and  the risk of inadequate quality checks. As a result of this tradeoff, the update was released with a logic error, which eventually made its way into the production environment.

A good deployment pattern might be missing

Even though a faulty code is pushed into the production environment, a good deployment strategy with manatured deploy patterns should minimize the downtime and risks by identifying the risks in the early phase of rolling out.

This is the second aspect we need to consider. I am confident that CrowdStrike’s operations team is familiar with effective deployment patterns, such as canary deployment and staged deployment. For instance, the canary deployment pattern involves rolling out changes to a limited group of users initially to evaluate the changes.

If canary deployment had been used, it should have confined the impact to a small number of users and minimized downtime. What is surprising about this incident is how quickly it spread and the apparent lack of any controlled rollout strategy. It is unclear whether the patch that caused the incident was intended to address an emerging zero-day security issue, but deploying regular changes on a Friday could also be a point of concern.

Potential poor integration between Microsoft kernel and security vendors

Another question I had was why a third-party security vendor like CrowdStrike could cause an operating system crash. Doesn’t this suggest a missing basic security control, such as least privilege management in Windows? Upon reviewing how CrowdStrike Falcon operates, it appears that Falcon drivers function at the kernel level, granting them high privileges and direct access to hardware and system resources. This means Falcon is likely loaded early during system boot to detect security issues at the kernel level.

Although I’m not entirely familiar with how the Microsoft Kernel interacts with and manages security software that are operated at the Kernel level, it seems crucial for Microsoft to ensure that security tools can integrate deeply without compromising kernel stability or security. Granting high privileges to security software should be carefully managed to maintain overall system integrity. A failure in the security software should NOT cause a BSOD if a possible control of extensibility is implemented.

How to Avoid Similar Issues in the Future

You might be surprised to learn that this isn’t the first global tech glitch caused by an update from a security company; a similar incident happened at MacFee back in 2010.  No one could guarantee this incident will not happen again, but how to we avoid the similar issue or low the risks 

Balancing Speed and Quality in Security Operations

For security companies, achieving a balance between rapid threat detection and thorough quality control is crucial. While swift deployment of new security changes to protect its customer is vital, it’s equally important to ensure the accuracy and reliability of these detections. Back to the CrowdStrike incident, a good thorough quality assurance testing could have eliminated the risks before the products pushed out into pronunciation. 

Here are some key factors to consider to improve the quality assurances when still meeting the speedy delivery requirements.

  1. Develop extensive test coverage, expand it when new changes are added
  2. Maximize test automation to minimize engineering overhead
  3. Implement tests early in CI/CD pipeline for “shift left” approach
  4. Establish core peer review guidelines for change management

Selecting the right deployment release patterns 

For organizations, like crowdStrike, requiring high-velocity software deployment while minimizing risk and ensuring a smooth user experience. They should choose good deployment/release patterns to identify the issue earlier to limit its impact of a faulty update.

As mentioned earlier, a canary release pattern might be a good pattern by rolling out the update to a small subset of systems first, monitoring the performance and then continuing the wider deployment.

Enhanced Kernel APIs for Security Software

Regarding the CrowdStrike incident, one crucial reason contributed to the incident  is that CrowdStrike Falcon runs at the Windows Kernel level, and a crash in this  third-party software causes the BSOD issue at the Kernal level. I believe it also highlights potential compatibility or performance difficulties that may come when implementing advanced security solutions with the Windows kernel. 

I am not very familiar to the microsoft kernel working mechanism, but after some researching, I think the following aspects should be considered at the Microsoft if these are possible

  1. Provide more stable and well-documented kernel APIs for security software vendors. This helps ensure compatibility across different Windows versions and reduces the risk of conflicts. 
  2. Offer extensibility points specifically designed for security software, allowing them to integrate deeply without compromising kernel stability or security.
  3. Implement graceful degradation instead of system-wide failure
  4. Isolate third-party kernel modules in a restricted environment

Reflections on the incident

Incidents occur, and even the best policies and controls can occasionally overlook issues. Unfortunately, this has happened with CrowdStrike, and the impact has been significant. The key takeaway is to learn from these incidents so that we can improve our Software Development Life Cycle (SDLC) and work towards preventing similar issues in the future or at least reducing the associated risks.