Y2K wash-up: what actually broke

A week on from the rollover, with enough operator reports filtering through to form a picture. Time for the wash-up post I promised myself I would write, and then close out the topic.

What broke loudly

The loudly-broken things have been thoroughly reported. Worth listing for completeness:

  • A handful of nuclear-monitoring stations in the United States lost some non-critical telemetry briefly.
  • Some 7-Eleven point-of-sale systems in the US and a smaller number in the UK rejected credit cards on the 1st of January because the year was being processed as 1900.
  • Several Norwegian railway ticket machines refused to operate.
  • A small number of lottery terminals in Italy refused to operate.
  • A few hundred businesses worldwide reported localised problems with billing software.

None of this is catastrophic. Most of it was fixed within hours. The aggregate global impact, by any reasonable measure, is far smaller than the press coverage anticipated.

The discipline of two years of remediation work, driven by a hard deadline, did its job. This is, on its own terms, a remarkable engineering success — perhaps the largest single deliberate intervention in deployed computing infrastructure in history. The fact that it is now being read as anticlimax is an injustice to the work that prevented the climax.

What broke quietly

This is where the more interesting story is. The reports trickling in from operators I correspond with describe a different shape of failure than the public press is covering.

Subtle date-arithmetic errors. A friend at a UK insurance company reports that several reports generated on the 4th of January displayed transaction dates as 04/01/100 rather than 04/01/2000. The application's internal handling was correct; the display layer was using a two-digit year format that, post-rollover, produced three-character years. Visible but harmless; symptomatic of incomplete remediation.

Calendar-dependent batch jobs. Another correspondent runs a small business with a custom application that calculates invoice ageing. The application worked correctly across the rollover except for one specific calculation: any invoice from December 1999 was being aged as 99 years old rather than 2 weeks. The cause was a date-difference function that had been patched in two places but not in a third. The business does manual invoice ageing now while the third place is being fixed.

Embedded systems that partly failed. Several reports of building-management systems that work normally but have lost their schedule. Lift control panels that work for travel but produce confusing error messages on the maintenance display. A heating controller that works but cannot remember the day-of-week schedule. None of these were on anyone's compliance list and none have safety implications, but they are a real operational nuisance for the buildings affected.

Backup systems with broken date logic. Several operators have reported that their backup systems are failing to find the correct files for restoration because the date logic compares strings rather than parsed dates, and "00" sorts before "99". The data is intact; the indices are confused. The fix is per-system and is taking real time.

What did not break, but worried me

A few things that I had been quietly anxious about, which turned out to be fine.

The DNS root. I had no specific reason to think the DNS root servers would have trouble, but the consequences of even brief root unavailability are substantial. They were fine. Lookups across the rollover proceeded without measurable disruption.

The routing infrastructure. Major carrier routers and BGP infrastructure continued working without notable incident. The few minor blips that did show up have been traced to scheduled changes, not to date arithmetic.

Payment networks. Visa, Mastercard, the various national payment switches — all came through cleanly. Considering the volume of date-sensitive code in these systems, this is a significant achievement.

Power generation. The grid in the UK and most other developed countries had no Y2K-related issues. In some less-developed economies there were brief generator-control issues, mostly resolved within hours.

What this teaches me about my own infrastructure

Three things, in increasing order of generality.

Display-layer bugs survive remediation more often than calculation-layer bugs. Almost every quiet failure I have read about is in the presentation of dates rather than the calculation of them. The remediation effort focused on calculations; the display layer is harder to test exhaustively because the test data is human-readable rather than machine-readable. I should pay more attention to the display layer of my own services.

Pressure-driven remediation produces side effects. The third-place-not-fixed pattern is exactly the Y2K-driven security regression I predicted in November. Code modified under deadline, with reduced testing, with multiple parallel fixes — at least some of these are going to produce bugs that emerge in the next few months. The attentive operator should be ready for advisories from this category.

The visible failures undersell the value of the work. The successful remediation produced few headlines. The unsuccessful — which is to say, the unremediated bits and pieces — are getting coverage. This is a structural injustice; the engineering culture that delivered most of the work deserves better acknowledgement than it is getting. From the inside of any of those teams, the absence of catastrophe is precisely the proof that the work was effective.

What I expect over the next quarter

A short list of categories I am paying attention to:

  • More display-layer bugs surfacing as people generate quarterly reports.
  • Security advisories from rushed remediation patches; I expect the first of these by April.
  • Operational nuisances from embedded systems that partly survived. These are a long-tail problem and will continue trickling for months.
  • Recovery work — rebuilding the date logic in systems that limped through but are now visibly wonky.

The story is going to recede from the news but the operational consequences will continue. For my own work, I am keeping one eye on advisories and one eye on my own infrastructure, and otherwise treating the topic as closed.

More on the new year's actual events as they happen. The year is off to a quiet start, which is exactly the right way to start a year that I expect to be busy.


Back to all writing