Y2K: what is actually fragile

Two months out from the Y2K rollover. The press coverage is in full apocalypse mode. The IT departments I know are, by contrast, quietly working through punch lists with the calm of people who have been at this for two years.

Neither side is, in my experience, telling the most interesting story about what is actually fragile. The fragility is real, but it is not in the places either the apocalypse narrative or the calm reassurance suggests.

This post is what I have been seeing.

The apocalypse narrative

Aircraft will fall from the sky. Power grids will collapse. Banks will lose track of accounts. Civilisation will pause for a few days while everyone reboots.

Most of this is wrong. Aircraft autopilots and air-traffic systems have been audited extensively; the genuine remaining issues are bounded. Power grids have been working through their punch lists since 1997. Banks have, by some measure, the most robust Y2K preparation of any sector — the financial industry simply could not afford to fail.

The spectacular failures the press anticipates are mostly not going to happen. Some failures will happen, but they will be at the margins rather than at the centres.

The reassurance narrative

Everything is fixed. The IT departments have done their work. December 31st will be uneventful. Stand down.

Most of this is also wrong. The IT departments have done their work on the systems they know about. The systems they know about are an inventory; the inventory is incomplete; the systems not in the inventory are not getting fixed.

The critical word is known. "All known affected systems have been remediated." This is true. It is also a much smaller set than "all systems".

What is actually fragile

Four categories, in increasing order of how badly I expect them to bite.

Embedded systems in long-life equipment. Building controls, elevator firmware, industrial sensors, environmental monitoring. The general pattern: a piece of equipment with a microcontroller that was programmed in 1985 and has not been touched since. The microcontroller has internal date logic. When the date rolls over, the logic fails. The equipment may continue to operate, may operate badly, or may refuse to operate.

Most of these are not safety-critical. Most of those that are safety-critical have been audited. The long tail is in the equipment that is operationally critical — that is, the equipment that runs the business — but is not safety-critical and is not on anyone's compliance list. Lift control panels. HVAC schedulers. Card-access systems. Industrial scales.

Niche software with active maintenance contracts. A bespoke program running on a small number of machines, with one contractor who has known about it for years. The contractor will have made the fix. They will have tested it. They will have applied it. Then they will retire, get hit by a bus, or simply not be reachable on January 1st when something goes wrong. The fix that was made in 1998 may have side effects that only manifest under year-2000 dates.

The pattern that worries me most here is the interaction between fixed components. Each component was fixed in isolation. The combination of fixes may behave differently than either fix alone in the new date space.

Database date storage. Most databases store dates well. Some store dates badly — two-digit years, or dates as integers, or dates as strings. Most of those have been migrated. Some were missed. The miss often comes from a column that was a date once and is now used for something else, with a name like YR that nobody remembers the semantics of.

The failures here will be subtle: queries returning slightly wrong results because a date comparison went sideways, reports showing data from "the wrong" period, occasional records mysteriously missing from result sets.

Security implications of remediation. This is the one I have been writing about all year, and I think it is the most important one.

The remediation work has been done under time pressure. Source code that nobody had touched in years has been reopened, modified, recompiled, redeployed. The modifications have been peer-reviewed less than they should have been. The deployments have been tested less than they should have been.

This is exactly the environment in which security regressions creep in. A buffer that used to be 12 bytes is now 16 to accommodate a four-digit year representation; the code that fills the buffer was not updated; the off-by-four overflow is a vulnerability.

A validation routine that used to check len(date) == 8 (for MM/DD/YY) now checks len(date) == 10 (for MM/DD/YYYY); somewhere there is still a code path that produces an 8-character date and now fails the validation, which is recovered by skipping the validation, which removes a check that was protecting against something else.

A module that used to receive dates only from a trusted internal source now receives them from a different source as part of the remediation; the new source is less trusted; the module's input validation is now insufficient.

None of these are theoretical. I have seen variants of all three already in 1999, in software released by people I work with. The Y2K remediation has produced, by my rough count, about twice the rate of new vulnerability advisories I would have expected from those projects in a normal year.

What I am doing about it

For my own infrastructure, the work is essentially done. My systems run modern operating systems with proper date handling. The applications I depend on have been audited or replaced. I will be at the keyboard at midnight, watching, but I do not expect anything serious to happen.

For the post-rollover period — say the first three months of 2000 — I am keeping a sharper eye than usual on advisories. The class of vulnerability I expect to bloom is exactly the kind that comes from rushed remediation: subtle bugs in old code that has been recently modified.

For the friends and small organisations I help, the message is: we have done the work; we should still expect something to surprise us; have a written response procedure for the case of any unexpected failure; have a rollback plan for any system that has been recently changed.

The deeper lesson

Y2K is, taken as a whole, the largest deliberate change to deployed computing infrastructure in history. It has been carried out under unusual pressure, with unusual scope, in unusual time. The aggregate effect has been positive — many old systems are now better understood and more robust than they were two years ago.

The specific Y2K story will dominate the next eight weeks of news. The longer story — that pressure-driven changes to legacy systems are a security event in their own right — is one I expect to be writing about for years afterwards.

This is also the story that most operators are not yet thinking about. The IT department is celebrating having delivered the punch list on time. The security implications are arriving in the new year and will not be visible until they begin to manifest as exploited bugs. By the time the connection is being made, the remediation work will have moved on to other priorities.

Keep one eye on the legacy systems through 2000 and 2001. The advisories will not stop on January 1st; they will accelerate.