Spectre and Meltdown · Peter Bassill

The Spectre and Meltdown disclosures broke yesterday — early, before the agreed coordinated-disclosure date of the 9th, after The Register published a story on the 2nd that forced the coordinated rollout to advance (meltdownattack.com paper, spectreattack.com paper). Three CVEs: CVE-2017-5754 (Meltdown), CVE-2017-5715 (Spectre Variant 2), and CVE-2017-5753 (Spectre Variant 1). The vulnerabilities are in the speculative-execution and out-of-order execution behaviours of essentially every modern processor — Intel, AMD, ARM, and several others — and exploit timing side-channels in the cache hierarchy to leak information across security boundaries that the architecture is supposed to maintain.

The technical content is interesting in a way that hardware vulnerabilities rarely are, because the affected behaviour is not a bug but a deliberate design choice. Modern processors execute instructions speculatively — predicting branch outcomes, executing the predicted path before the prediction is verified, and discarding the speculative results if the prediction was wrong. The discarding is architectural — the registers and memory state that an unprivileged program can observe revert to what they would have been without the speculation — but the cache state, microarchitecturally, retains traces of the speculatively-executed memory accesses. By measuring the timing of subsequent cache accesses, a sufficiently careful attacker can infer information about what the speculative execution touched. Meltdown specifically uses this to read memory across the user-kernel boundary on Intel processors that perform speculative loads from privileged memory regions before checking access permissions. Spectre uses similar techniques to read memory across application boundaries by manipulating branch predictors and the speculative execution they drive.

The deployment population is "essentially all current computing infrastructure". Server-side Intel Xeon processors (the bulk of cloud and on-premises virtualisation), client-side Intel Core (most laptops and desktops), AMD EPYC and Ryzen with somewhat different exposure, ARM cores with again somewhat different exposure across the manufactures using them, IBM POWER. Specific processor families have specific exposures and specific patches; the patch landscape is, in early January 2018, a confusing matrix of microcode updates, OS kernel changes, hypervisor changes, and application-level workarounds depending on which combination of variant, processor, and stack a given system uses.

The performance impact of the mitigations is the part that has the customer estates concerned. The KPTI / KAISER kernel-page-table-isolation patches that mitigate Meltdown impose a performance cost on workloads that make many syscalls — database servers, web servers, and similar workloads see 5-30% degradation depending on workload profile, with database servers at the high end. The Spectre Variant 2 mitigations (retpoline, IBRS) impose additional costs that vary by processor generation, with older Intel Xeon E5 v3 and v4 generations seeing larger impacts than newer Skylake/Kaby Lake parts. The aggregate operational cost of the mitigations is substantial and is going to drive infrastructure-capacity conversations across customer estates for the rest of the year.

The patch-state inventory work for the customer estates this week and next is the immediate operational task. Identify every Intel-and-other-affected-processor host across every customer estate. Determine the current patch state for OS kernel, hypervisor (where present), microcode, and application stack. Apply patches in the priority order recommended by the affected vendors. Test for performance impact and capacity implications. Schedule additional capacity where the post-mitigation performance breaks SLA. The work is substantial; I am estimating four to six weeks of dedicated time across the team.

For the architectural conversation, Spectre and Meltdown reframe a question that has been latent in the security community for some years. The hardware-software boundary has, since the early days of computing, been treated as a place where the abstractions are reliable — the hardware presents a well-defined architectural model to software, and the security properties of that model can be reasoned about formally. The microarchitectural reality has been understood by hardware designers for years (cache timing side-channels are not new — there is academic literature going back to the late 1990s, including Tromer-Osvik-Shamir on AES timing attacks circa 2003) but the conversion of those microarchitectural realities into practically-exploitable cross-domain attacks against general-purpose code has been, until this disclosure, more theoretical than operational. The disclosures have made the gap concrete. The implications for cloud-multi-tenant architectures, for browser-side sandboxing of untrusted JavaScript, and for any other context where untrusted code shares a processor with privileged data, are substantial. The remediation strategy is not one fix; it is a long sequence of architectural, microcode, and software changes that will run for years and may require, at the limit, a generational shift in processor design.

For our customer estates, the pragmatic priority order this week. First, public-cloud workloads, because the cloud-provider mitigations have already been applied (AWS, Azure, GCP all rolled out hypervisor patches in late December/early January) and the customer-side action is patching the guest OSes. Second, internet-facing on-premises infrastructure, because the attack surface is exposed and the patching is straightforward. Third, internal infrastructure, where the threat model is more nuanced and the prioritisation can be made on a per-asset basis. Fourth, embedded systems and appliances where the patching path may not exist — those need compensating controls and a longer-term roadmap.

The Emily-team conversation this week is about whether speculative-execution side-channel exploitation is detectable from SOC-grade telemetry. The honest answer is: not directly. The exploitation happens microarchitecturally and produces no network-observable signal during the leakage phase. The post-exploitation use of leaked information may produce signals (credential reuse, lateral movement, data exfiltration) that are detectable through normal SOC controls, but the attack itself is invisible from outside the affected system. The customer briefings need to be honest about this — Spectre and Meltdown are not a detection-engineering problem, they are a patching-and-architecture problem.

The longer-form piece on what this means for the next decade of secure computing is something I want to write properly in February or March. The early thoughts are that the hardware-software boundary needs more careful attention than it has historically received, that the formal-verification programme for hardware security properties (which exists but is small) needs much more investment, and that the long-running conversation about confidential-computing architectures (Intel SGX, ARM TrustZone, AMD SEV) is going to be reshaped substantially by what Spectre and Meltdown have demonstrated about the limits of the current model.