Microsoft is Working with the Security Industry to Prevent Another CrowdStrike Outage

Microsoft Security

Microsoft published a technical deep dive into what went wrong with the CrowdStrike outage and how it will prevent issues like this in the future. That said, the details about these coming changes are light.

“CrowdStrike recently published a Preliminary Post Incident Review analyzing their outage,” Microsoft vice president David Weston writes in a new post to the Microsoft Security Blog. “CrowdStrike describes the root cause as a memory safety issue—specifically, a read out-of-bounds access violation in the CSagent driver … Our observations confirm CrowdStrike’s analysis that this was a read-out-of-bounds memory safety error in the CrowdStrike developed CSagent.sys driver.”

With the blame fully placed on CrowdStrike, Weston explains that this driver is what’s called a file system filter driver, a type of driver that’s commonly used by security products to scan any new file saved to disk, such as a file downloaded with a web browser. But file system filters can also be used as a “signal for security solutions attempting to monitor the behavior of the system,” he adds. And this is what CrowdStrike does at the kernel level: The CSagent driver is called when “a named pipe creation” operation—basically, when one process attempts to send data to another process—occurs so that it can engage its malicious behavior detection capabilities.

Weston says that this driver is one of four driver modules that CrowdStrike loads, and it receives dynamic control and content updates quite frequently. Thanks to a logic error in an update to this driver, it triggered an invalid memory access issue. And because it triggers so frequently, this driver went from no crashes the day before the outage to over 4 million crashes across over 2 million Windows PCs and servers the day of the outage. (The data he references is a subset of all PCs and servers impacted because only a subset of the user base shares crash reports with Microsoft.)

“Any reliability problem like this invalid memory access issue can lead to widespread availability issues when not combined with safe deployment practices,” he says.

So why use kernel drivers?

According to Weston, security vendors like CrowdStrike, like Microsoft, use kernel drivers for system-wide visibility, where loading early during the boot process helps security services detect boot kits and root kits before user-mode applications load. (CrowdStrike’s driver uses an Early Launch Antimalware (ELAM) capability that Microsoft created so that signed drivers could load as early as possible in the boot process.) These drivers have special capabilities, like the ability to block activities like process and file creation. Kernel drivers generally perform better, which is always a concern, but Weston says that modern code changes outside of kernel mode in recent years have closed the performance gap. And kernel drivers are tamper resistant, helping protect them from “malware attacks, targeted attacks, and malicious insiders.”

Kernel drivers also lower the potential resilience of the machine on which they’re installed. Because they run at the kernel level, there are much fewer containment and recovery capabilities available when something goes wrong. This was ably demonstrated by the CrowdStrike outage.

To address this problem, Microsoft has been moving complex core services from the kernel to user mode in recent years, most notably with the font file parsing changes in made in 2019. It dramatically raised the security defaults in Windows 11 to include TPM 2.0, Secure Boot, VBS, and other protections as the security baseline. And it announced more security advances this past Spring.

But it’s not just on Microsoft: Weston notes that security solutions can minimize their use of kernel mode drivers right now by moving updating, content parsing, and other operations into user mode where there are more containment and recovery options. As he notes, Windows provides several user mode protections for anti-tampering, including Virtualization-based security (VBS) Enclaves and Protected Processes, ETW events, and user-mode interfaces like Antimalware Scan Interface for event visibility. “These robust mechanisms can be used to reduce the amount of kernel code needed to create a security solution, which balances security and robustness,” he says.

Behind the scenes, Microsoft engages with security companies through its Microsoft Virus Initiative (MVI) industry forum.

“Microsoft works with members of MVI to ensure compatibility with Windows updates, improve performance, and address reliability issues,” he continues. “MVI partners actively participating in the program contribute to making the ecosystem more resilient and gain benefits including technical briefings, feedback loops with Microsoft product teams, and access to anti-malware platform features such as ELAM and Protected Processes. Microsoft also provides runtime protection such as Patch Guard to prevent disruptive behavior from kernel driver types like anti-malware.”

Of course, the CrowdStrike outage has raised awareness of the need to better protect our worldwide computing infrastructure and prevent future attacks based on information gleaned from CrowdStrike’s mistakes. And the first step is to make sure that security vendors take advantage of the many advances Microsoft has made to Windows in recent years.

“We plan to work with the anti-malware ecosystem to take advantage of these integrated features to modernize their approach, helping to support and even increase security along with reliability,” he says. “This includes helping the ecosystem by reducing the need for kernel drivers to access important security data, providing enhanced isolation and anti-tampering capabilities with technologies like our recently announced VBS enclaves, and enabling zero trust approaches like high integrity attestation which provides a method to determine the security state of the machine based on the health of Windows native security features.”

More vaguely, Weston also noted Microsoft’s work to bring Rust to the Windows kernel as part of its Secure Future Initiative (SFI). And while he didn’t claim that it would expand its use of Rust in the kernel, this language is memory-safe and its use in kernel drivers would likely have prevented the CrowdStrike outage.

Tagged with

Share post

Thurrott