Microsoft is Working with the Security Industry to Prevent Another CrowdStrike Outage

Paul Thurrott
Jul 28, 2024
0

Microsoft Security

Microsoft published a technical deep dive into what went wrong with the CrowdStrike outage and how it will prevent issues like this in the future. That said, the details about these coming changes are light.

“CrowdStrike recently published a Preliminary Post Incident Review analyzing their outage,” Microsoft vice president David Weston writes in a new post to the Microsoft Security Blog. “CrowdStrike describes the root cause as a memory safety issue—specifically, a read out-of-bounds access violation in the CSagent driver … Our observations confirm CrowdStrike’s analysis that this was a read-out-of-bounds memory safety error in the CrowdStrike developed CSagent.sys driver.”

With the blame fully placed on CrowdStrike, Weston explains that this driver is what’s called a file system filter driver, a type of driver that’s commonly used by security products to scan any new file saved to disk, such as a file downloaded with a web browser. But file system filters can also be used as a “signal for security solutions attempting to monitor the behavior of the system,” he adds. And this is what CrowdStrike does at the kernel level: The CSagent driver is called when “a named pipe creation” operation—basically, when one process attempts to send data to another process—occurs so that it can engage its malicious behavior detection capabilities.

Weston says that this driver is one of four driver modules that CrowdStrike loads, and it receives dynamic control and content updates quite frequently. Thanks to a logic error in an update to this driver, it triggered an invalid memory access issue. And because it triggers so frequently, this driver went from no crashes the day before the outage to over 4 million crashes across over 2 million Windows PCs and servers the day of the outage. (The data he references is a subset of all PCs and servers impacted because only a subset of the user base shares crash reports with Microsoft.)

“Any reliability problem like this invalid memory access issue can lead to widespread availability issues when not combined with safe deployment practices,” he says.

So why use kernel drivers?

According to Weston, security vendors like CrowdStrike, like Microsoft, use kernel drivers for system-wide visibility, where loading early during the boot process helps security services detect boot kits and root kits before user-mode applications load. (CrowdStrike’s driver uses an Early Launch Antimalware (ELAM) capability that Microsoft created so that signed drivers could load as early as possible in the boot process.) These drivers have special capabilities, like the ability to block activities like process and file creation. Kernel drivers generally perform better, which is always a concern, but Weston says that modern code changes outside of kernel mode in recent years have closed the performance gap. And kernel drivers are tamper resistant, helping protect them from “malware attacks, targeted attacks, and malicious insiders.”

Kernel drivers also lower the potential resilience of the machine on which they’re installed. Because they run at the kernel level, there are much fewer containment and recovery capabilities available when something goes wrong. This was ably demonstrated by the CrowdStrike outage.

To address this problem, Microsoft has been moving complex core services from the kernel to user mode in recent years, most notably with the font file parsing changes in made in 2019. It dramatically raised the security defaults in Windows 11 to include TPM 2.0, Secure Boot, VBS, and other protections as the security baseline. And it announced more security advances this past Spring.

But it’s not just on Microsoft: Weston notes that security solutions can minimize their use of kernel mode drivers right now by moving updating, content parsing, and other operations into user mode where there are more containment and recovery options. As he notes, Windows provides several user mode protections for anti-tampering, including Virtualization-based security (VBS) Enclaves and Protected Processes, ETW events, and user-mode interfaces like Antimalware Scan Interface for event visibility. “These robust mechanisms can be used to reduce the amount of kernel code needed to create a security solution, which balances security and robustness,” he says.

Behind the scenes, Microsoft engages with security companies through its Microsoft Virus Initiative (MVI) industry forum.

“Microsoft works with members of MVI to ensure compatibility with Windows updates, improve performance, and address reliability issues,” he continues. “MVI partners actively participating in the program contribute to making the ecosystem more resilient and gain benefits including technical briefings, feedback loops with Microsoft product teams, and access to anti-malware platform features such as ELAM and Protected Processes. Microsoft also provides runtime protection such as Patch Guard to prevent disruptive behavior from kernel driver types like anti-malware.”

Of course, the CrowdStrike outage has raised awareness of the need to better protect our worldwide computing infrastructure and prevent future attacks based on information gleaned from CrowdStrike’s mistakes. And the first step is to make sure that security vendors take advantage of the many advances Microsoft has made to Windows in recent years.

“We plan to work with the anti-malware ecosystem to take advantage of these integrated features to modernize their approach, helping to support and even increase security along with reliability,” he says. “This includes helping the ecosystem by reducing the need for kernel drivers to access important security data, providing enhanced isolation and anti-tampering capabilities with technologies like our recently announced VBS enclaves, and enabling zero trust approaches like high integrity attestation which provides a method to determine the security state of the machine based on the health of Windows native security features.”

More vaguely, Weston also noted Microsoft’s work to bring Rust to the Windows kernel as part of its Secure Future Initiative (SFI). And while he didn’t claim that it would expand its use of Rust in the kernel, this language is memory-safe and its use in kernel drivers would likely have prevented the CrowdStrike outage.

Tagged with

About author

Paul Thurrott

Paul Thurrott is an award-winning technology journalist and blogger with 30 years of industry experience and the author of 30 books. He is the owner of Thurrott.com and the host of three tech podcasts: Windows Weekly with Leo Laporte and Richard Campbell, Hands-On Windows, and First Ring Daily with Brad Sams. He was formerly the senior technology analyst at Windows IT Pro and the creator of the SuperSite for Windows from 1999 to 2014 and the Major Domo of Thurrott.com while at BWW Media Group from 2015 to 2023. You can reach Paul via email, Twitter or Mastodon.

View Articles

Currently on Forums
Visit the forums
- Ask Paul for Friday, June 26
  Posted by Paul Thurrott
  
  4
  comments
- Interview with Cory Doctorow regarding AI and the AI Bubble
  Posted by anoldamigauser
  
  9
  comments
- [CLOSED] Ask Paul for Friday, June 19
  Posted by Paul Thurrott
  
  5
  comments
- Microsoft Office 365 Desktop Apps – Upgrade your plan banner
  Posted by Lee Thacker
  
  6
  comments
Podcasts
Podcast Hub
- Windows Weekly 989: Deer Hate MSDN
  
  Aired on June 25, 2026 by Paul Thurrott with 0 Comments
- First Ring Daily 1983: Digging a Ditch
  
  Aired on June 25, 2026 by Brad Sams with 0 Comments
- First Ring Daily 1982: Of the People
  
  Aired on June 24, 2026 by Brad Sams with 2 Comments
- First Ring Daily 1981: The Price Lands
  
  Aired on June 23, 2026 by Brad Sams with 0 Comments
Join the crowd where the love of tech is real - become a Thurrott Premium Member today!

Explore Premium Benefits

Tagged with

Share post