CrowdStrike: cyber resilience without a bad actor

Four weeks after the CrowdStrike Falcon update that took 8.5 million Windows machines offline, the post-mortem is in. The interesting question is not what CrowdStrike did wrong. It is what the rest of us did wrong by assuming this kind of event could not happen.

On 19 July, a faulty content update from CrowdStrike's Falcon endpoint protection platform took approximately 8.5 million Windows machines offline globally in the space of a few hours. Airports, hospitals, banks, broadcasters, supermarkets, retailers, government departments, and emergency services were all affected. The NHS reported significant impact across primary care, with GP practices and pathology services struggling for the rest of that day and into the following week. Sky News went off air. The fix required, in many cases, hands-on local intervention on each affected machine — booting into Safe Mode and removing a specific file. For organisations with tens of thousands of devices spread across multiple sites, the recovery took days.

CrowdStrike's Root Cause Analysis has now been published. The technical detail — a misconfigured channel file that caused a kernel-mode crash in the Falcon driver, deployed via a content update that bypassed the usual canary process — is interesting and worth reading. The interesting question, for any board sitting in this morning's executive meeting, is not what CrowdStrike did wrong. It is what the rest of us did wrong by assuming this kind of event could not happen.

Why this matters beyond CrowdStrike

For most of the past decade, cyber resilience has been understood as resilience against attackers. Patch the vulnerabilities, train the staff, monitor for anomalies, plan the incident response. The CrowdStrike event involved no attacker. It was caused by a routine content update from a trusted vendor, deployed kernel-mode by design, against a customer base that had — entirely correctly — configured the product to update automatically.

The cyber industry sold automatic updates from trusted vendors as a security best practice. It is. It is also, in this case, the mechanism that took 8.5 million machines offline. The lesson is not that automatic updates were wrong. The lesson is that we had been treating trusted vendor as the same thing as cannot make a critical mistake, and that assumption has now been falsified at the largest possible scale.

Three structural things that need looking at

Single-vendor concentration on the security agent. CrowdStrike Falcon is, alongside Microsoft Defender for Endpoint and SentinelOne, one of three dominant kernel-mode endpoint security agents in the enterprise market. For a large organisation that has standardised on one of them, the agent is running on essentially every machine. A faulty update to that agent has a blast radius equal to the entire estate. For Cyber Essentials purposes, having an EDR is part of the malware control. For resilience purposes, having only one may now be the wrong answer for organisations whose service availability matters at national-infrastructure scale.

This does not necessarily mean running two EDRs in parallel — that is operationally complex and not always achievable. It does mean modelling the agent failure scenario as a real risk, alongside the cyber attack scenario, when evaluating EDR procurement.

Kernel-mode operation and the supplier's role. Endpoint security agents operate in kernel mode by necessity — to inspect system calls and intervene against malware before it executes. That necessary privilege also means that a faulty agent can crash the kernel, which is exactly what happened. Microsoft has signalled it will reconsider the kernel-level access provided to third-party security products on Windows, partly in response to this event. That conversation has been going on for years (Apple has already gone the other way, with macOS deprecating kernel extensions in favour of user-mode system extensions). After Falcon, expect the Windows direction of travel to accelerate.

Update cadence and canary discipline at the vendor. CrowdStrike's content updates were shipped with less rigorous canarying than the platform's full software updates. The RCA notes that this is being changed — content updates will now flow through a more conservative deployment ring system, similar to how operating system updates are deployed. The lesson for buyers is to ask vendors specifically about their update deployment ring discipline — not whether they have canaries, but how the canary process works for emergency content updates, threat intelligence updates, and signature definitions.

What boards should ask this quarter

Five questions for the executive briefing.

If a routine update from one of our top five software vendors caused our entire estate to become unavailable, how long would we be down, and how would we recover? This is the Falcon question generalised. The answer should be a time, a process, and an honest acknowledgement of which parts of the estate would be slow to recover.

What is our blast radius from each of our top three security agents? CrowdStrike was the headline. The same shape of risk applies to Microsoft Defender, SentinelOne, Sophos, and to a number of less-visible kernel-mode agents (some MDM clients, some VPN clients, some backup agents).

Do we have at least one machine per critical role that does not run the same agents as the rest? This is not a security recommendation in the usual sense. It is a resilience recommendation. A blue team's emergency machine that is configured differently, can boot in Safe Mode quickly, and can be used to run recovery scripts is a small one-off investment.

How quickly can we initiate a remote remediation on every affected machine? If the answer involves physical access to every machine, the answer is days to weeks at scale. The firms that recovered fastest from CrowdStrike had remote management tooling that could push the fix without requiring local intervention. Most firms did not.

Do our contracts with major software vendors include clauses requiring them to participate in incident response and remediation in the event of a vendor-caused outage? Most do not. Many should.

The wider question for the CrowdStrike-shaped industry

The cyber industry will spend the next few years adjusting to the implications of Falcon. EDR vendors will tighten their update processes. Microsoft will reconsider kernel access. Customers will reconsider single-vendor concentration. Regulators will look at the systemic risk of a small number of vendors having simultaneous kernel access to most of the world's critical machines.

For UK boards, the most useful framing is to add vendor-caused unavailability to the resilience risk register, alongside attacker-caused compromise. They are different shapes of risk, they require different mitigations, and they have been weighted unequally for too long. CrowdStrike is the event that should rebalance the weighting.

Not because CrowdStrike are villains — they are a respected security vendor that had a very bad day. But because what they accidentally demonstrated is the shape of an attack that future attackers will study, and that future supply chain pressures will replicate. The next event of this shape may not be accidental.