Final Y2K notes: what I am doing on the night

Two weeks out from the Y2K rollover. My infrastructure has been audited, patched where needed, and tested. The plans for the night are written. This post is, in part, the discipline of writing those plans down where I can find them again, and in part a record of what I am actually still slightly worried about.

The systems I am responsible for

A short inventory:

One Slackware 4.0 box at home running mail, DNS for a small zone, web hosting for friends, and the small honeypot.
One OpenBSD 2.5 box acting as the home firewall.
A small mail relay run on a friend's hardware in another city, accessed by SSH.
The home network's DSL modem (which is itself a small embedded computer) and the Cisco router behind it.

Nothing critical depends on any of these. The mail relay is the only piece anyone other than me would notice if it failed.

What I have done

A short audit, by item:

The Slackware box. All daemons are recent versions. The kernel is 2.2.13. BIND is 8.2.2-P5. Sendmail is 8.9.3. Apache is 1.3.9. None of these have known Y2K issues; all have been tested by setting the system clock to 2000-01-01 in a chroot and confirming normal operation.

The one minor finding: a Perl script I wrote for log analysis used a two-digit year representation and would have produced sort-order issues in 2000. Trivially fixed by switching to ISO-8601 dates. The fix has been in place for two months without trouble.

The OpenBSD box. OpenBSD 2.5 is unaffected by Y2K. The audit was a five-minute exercise.

The remote relay. Same software stack as my home box, so the same audit applied. Friend confirmed the date-rollover test from his end.

The DSL modem. Manufacturer's official statement says it is unaffected, with one specific exception around the line-quality reporting feature, which records dates in a two-digit format that will roll over confusingly. The reporting will look strange but the modem will continue to work. I have noted this and moved on.

The Cisco router. Cisco has published a detailed Y2K compatibility list. My specific model is on the "unaffected" list.

What I am doing on the night

I will be at the keyboard from about 22:00 GMT on the 31st through 02:00 GMT on the 1st. The plan is:

22:00: Take baseline measurements. Disk usage, CPU, network throughput, all daemon process listings, the contents of /var/log lengths, the running configuration. Save to a snapshot directory.

23:30: Confirm the baseline is unchanged from what I expect. Verify all daemons are running. Verify network connectivity to a few external sites I am using as canaries (a couple of friends' machines, a major news site, a DNS root server).

23:50: Stop logging into anything I do not strictly need to. Open a tcpdump on the firewall capturing everything. Open a tail on the system log. Have a fresh terminal ready for ad-hoc investigation.

00:00 GMT: Watch.

00:01: Run a quick health check. Same measurements as the baseline, compare. If everything is normal, relax slightly.

00:05: Send a test email to myself across the relay to confirm SMTP is working end to end. Resolve a few DNS names to confirm the resolver is working. Connect to my own web server to confirm Apache is responding. Run a quick ipchains rule check.

00:10 to 02:00: Continue watching. Read the Bugtraq archive for any incoming news. If anything looks wrong, investigate before the trail cools.

02:00: Close out. Save the night's logs to a separate directory. Go to bed.

This is more attention than the rollover probably warrants. It is also cheap, in the sense that I would have been at the keyboard anyway, and the discipline of having a written plan for an unusual operational event is useful in itself.

What I am still slightly worried about

Four items, in increasing order of how much they nag at me.

Embedded systems I do not control. The DSL modem is a known quantity. The lift in my building is not. The card-access system at work is not. The traffic lights on my street are not. None of these are my problem in any sense, and most are probably fine, but the failure modes are not in my mental model. If any of them fail badly, the consequences will be felt locally, and there is essentially nothing I can do about it.

Power. The grid in the UK has been preparing for this since at least 1997. The chance of a power failure that night is, on the available evidence, no higher than any other winter night. But "no higher than usual" is not zero, and a power failure during the rollover would mean the systems came back up to a different date state than they started at, with all the interesting failure modes that implies. I have a UPS on the firewall and the main host. The mail relay does not have a UPS and would simply restart from disk after any power blip.

Network connectivity. My ISP, like everyone else's, has been preparing. The chance of something going wrong with the upstream is, again, low. But if something does go wrong, my local infrastructure will be intact and isolated, and I will have no way to communicate with anyone. Having a phone number for the ISP's NOC is the minimum, and I have that. Having a backup form of connectivity (a dial-up modem to a different provider) would be more, and I do not have that.

Software I have not anticipated. The mental model of my infrastructure is roughly right. There may be one or two small things — a script I wrote three years ago and forgot, a cron job that runs once a year and now will not, a piece of state somewhere that gets corrupted by a non-obvious date interaction. The scope of the surprise is, almost by definition, things I cannot enumerate. The compensating control is the recent backups, which would let me restore to a known-good state from before any rollover incident.

What I expect to see

In rough probability order:

Nothing visible. The systems run through midnight and continue normally. About 80% likely.
A small inconvenience. A logfile with a malformed date entry. A timestamp that displays oddly somewhere. A script that produces a confusing report. About 15%.
A real problem with a known cause. Something I had not patched, behaving badly, that I can identify and fix on the night. About 4%.
A real problem with an unknown cause. Something behaving badly that I cannot quickly diagnose, that requires hours or days to work out. About 1%.

The distribution is, in part, a calibrated guess based on what I have read about other operators' preparations. The 1% probability of the genuinely unknown is what motivates the entire night's vigil.

A small philosophical note

Y2K is, I think, the most thoroughly-prepared-for technical event in computing history. The discipline of going through every single application, every database column, every file format, every API, every configuration option — and asking does this break in 2000? — is unusual in computing. We do not normally have the luxury or the pressure to do this systematically.

The discipline, in itself, has been good for the field. Codebases that had not been audited in years have been audited. Software that had been running on autopilot has been examined. The infrastructure of the world is, in some respects, better understood now than it was three years ago.

This is the part of Y2K that I think is going to be the durable legacy. The midnight rollover will be uneventful. The two-year discipline of preparing for it will have shifted the field's relationship with its own legacy software. We will be slightly more careful about long-lived systems, slightly more rigorous about state assumptions, slightly more aware that the unglamorous code we wrote in 1985 is still running in 2000. The change is small. The direction is right.

See everyone in the year 2000.