Yahoo: half a billion · Peter Bassill

Yahoo issued a press release this afternoon confirming "a recent investigation has confirmed that a copy of certain user account information was stolen from the company's network in late 2014 by what it believes is a state-sponsored actor" (Yahoo 8-K filing and statement, September 22 2016). The affected population is "at least 500 million" accounts. Names, email addresses, telephone numbers, dates of birth, hashed passwords (predominantly bcrypt, with some unspecified earlier hashes), and in some cases security-question answers were taken. Payment-card data and bank-account information are stated to be unaffected; Yahoo does not hold them on the affected systems.

The disclosure timing is the part of the story that is going to dominate the next several weeks. Verizon announced the acquisition of Yahoo's core internet business for $4.83 billion on the 25th of July 2016. The acquisition is in the midst of regulatory approval and integration planning. The disclosure today, of a 2014 incident discovered "recently", introduces a substantial uncertainty into the deal. Verizon issued a brief statement noting that they had been informed of the disclosure within the last two days and were assessing the implications (Verizon statement carried by Reuters, September 23). The deal will likely close, with the acquisition price renegotiated downward. The exact mechanism — material adverse change provisions, holdback against indemnity, price adjustment — will be worked out through Q4 of this year and into early 2017.

The technical content of the disclosure raises several questions that the press release does not answer. The "late 2014" date for the compromise places it close to the eBay disclosure (May 2014) and the JP Morgan disclosure (October 2014). The state-actor attribution is unusual in a Yahoo statement — most consumer-internet breach disclosures stop short of attribution language and leave that work to law enforcement and to the security research community. The bcrypt password hashing is the right answer for current accounts; the unspecified earlier hashes are presumably the legacy MD5-with-salt-or-not population that has, on the public reporting of similar consumer-internet platforms, been a long tail of sub-current crypto for legacy account creation cohorts. The full 500-million account population almost certainly includes accounts dormant for many years and accounts with credential reuse against many other services.

The credential-stuffing implication is substantial. Half a billion email addresses, with substantial overlap to the 167-million LinkedIn corpus surfaced in May, the MySpace 360 million, the Tumblr 65 million, and the various smaller corpora trafficked in the secondary breach market this year, produces an aggregate exposed-credential population that is, in some real sense, indistinguishable from "all internet users". The customer-protection work for organisations whose user populations overlap with that aggregate — which is to say all of them — is the same as it has been all year: enforce MFA, monitor authentication anomalies, accept that password-only authentication for any sensitive service is no longer defensible.

For the SOC, the inbox will fill tomorrow with the standard post-disclosure customer queries. The shape of those queries is well-understood at this point — the customer wants to know whether their organisation's user accounts are exposed, whether their employees' Yahoo-linked credentials are the trigger for any specific risk, and whether the SOC has detection in place for credential-stuffing against the customer's own authentication endpoints. The answers, in order: yes, almost certainly, given the size of the corpus; the trigger is general not specific; and yes, with the standard caveats about the false-positive rate of credential-stuffing detection at the rate-limit and behavioural levels.

The wider thought is about disclosure cadence. The Yahoo data was taken in late 2014. Yahoo has, on the press release, been investigating "for some weeks" leading up to this disclosure. The two-year delay between compromise and public disclosure is at the long end of what we have seen in 2015-2016, and the pattern (LinkedIn 2012-then-2016, MySpace 2008-or-thereabouts to 2016, this Yahoo case) is that consumer-internet platforms hold breach data internally for extended periods before public surfacing. The reasons are complex and not always disreputable — the determination of scope takes time, the legal posture has to be worked out, the operational remediation has to be in place before the disclosure prompts attempted abuse — but the cumulative effect is that the public disclosure substantially lags the actual exposure window. The trust implications of that pattern are not being adequately addressed in the regulatory environment, and the GDPR enforcement period, which begins May 2018, will materially change the calculus by mandating 72-hour notification for personal-data breaches affecting EU residents. Yahoo's disclosure, under GDPR, would have been a different document on a different timeline. That regulatory shift is going to be one of the larger structural consequences of GDPR.

The customer briefings tomorrow will cover the operational implications. The strategic conversation — about disclosure norms, about consumer-internet trust models, about the long-tail dwell time of breaches in the secondary market — is for a longer piece of writing. There is, again, a great deal to say.