Forensic readiness — what I learned from a friend's incident

A friend at a small UK consultancy had a serious compromise this month. They asked me to help with the response, partly because they did not have anyone in-house with serious experience and partly because they were quietly worried that the in-house response might not be impartial.

I have spent a long fortnight working through it with them. The investigation is mostly finished now. The findings are sanitised below — names removed, specific systems generalised — because the lessons are bigger than the specific incident.

The shape of what happened

A web-facing application, custom-developed, was compromised through a path-traversal vulnerability. The attacker — based on later analysis, a moderately skilled operator with some patience — used the path traversal to read configuration files, found database credentials, used them to extract a customer table, and exfiltrated it.

The consequence: about 12,000 customer records exposed, including names, email addresses, and partial billing information. No financial details (those were stored in a different system with proper segregation). The compromise was discovered when one of the customers complained about phishing emails apparently using exposed information.

The ICO will have to be notified. The legal advice is that the consequences are manageable but not nothing — possible fines, certainly customer trust damage, possibly some commercial losses.

What I learned trying to investigate

The technical work was straightforward. The harder problem was the evidence. Three categories of issue, each of which I had not appreciated before this incident.

Lesson one: log retention was inadequate

The initial discovery was on the 8th of April. The evidence trail showed the attack had started no later than the 24th of March. By the time we tried to reconstruct the early phase, the relevant Apache logs had been rotated and pruned according to the default policy — kept for two weeks, then deleted.

We had only fragmentary evidence of how the compromise actually started. The exfiltration itself was reconstructable; the initial reconnaissance, the failed attempts that came before the successful one, the IP rotation that the attacker used — all of it was lost.

This is the retention discipline I have written about but had not internalised. Two weeks is not enough. For a serious investigation, you need months. I had been telling myself this for two years and had not actually been doing it on my own infrastructure to the level the discipline implies.

Action for me: I have audited my own retention. Most of my services are at six months on the production host with offsite copies retained for a year. Some are at two weeks because I had not got around to extending them. The two-weekers are now at six months.

Lesson two: integrity of the evidence was unclear

The compromised host had been actively used for two weeks after the compromise without anyone realising. By the time we started investigation, the host had been rebooted, had had several patches applied, had been used for normal production work. The crime scene, in the forensic sense, was already substantially modified.

Worse, several pieces of evidence we found had inconsistent timestamps that could be interpreted multiple ways. Was a particular file modified during the attack, or during the post-attack normal use, or during the investigation itself? Without a clean record of who touched what when, the reconstruction had ambiguities we could not resolve.

Action for me: I have introduced a crime-scene discipline into my own incident-response procedure. The first thing on suspicion of compromise is to snapshot the host — take a memory image, take a disk image, lock down further changes — and then investigate from the snapshot, not from the live system. The snapshot tooling I am using is dd to a network destination for the disk and gcore plus saved process listings for memory. Crude but workable.

This discipline is a substantial operational cost. It means an incident that might be a compromise gets a snapshot taken before normal operation resumes. The cost is operational downtime. The benefit is investigatable evidence.

I have also written down the procedure and committed to following it. The discipline is worthless if it is forgotten in the moment of crisis.

Lesson three: the right people were not in the room

My friend's organisation handled the response with their developer team. The developers were good engineers but they had no incident-response training; they did not know what to preserve, what to touch, what to do first.

More importantly, the legal person who would eventually need to make decisions about ICO notification and customer communication was not involved until day five. By the time they were brought in, they were trying to make decisions based on a partial picture, with critical evidence already lost or contaminated.

Action for me, indirectly: I have spent some time helping my friend's organisation write a first-30-minutes checklist for incident response. The checklist names roles, not people — "the technical lead", "the legal contact", "the communications lead" — and the first action item is to identify who fills each role in the current incident.

For my own modest infrastructure, the equivalent is much simpler: an incident is between me and the relevant friend whose box is involved. The principles are the same — the roles need to be filled before anyone starts taking technical action.

The bigger lesson, which is uncomfortable

The biggest thing this incident has taught me is that I had been writing about defence as if the only question was preventing compromise. It is not. Compromise happens; the question of what you can recover from a compromise is at least as important as the question of whether you are vulnerable.

I have been extensively writing about prevention — patching, hardening, monitoring. I have been writing less about forensic readiness — the discipline of being able to investigate, prove, and recover from an incident after it happens.

The two are different. Prevention is reducing the probability of compromise. Readiness is reducing the cost of compromise when it happens. A serious security programme needs both.

What forensic readiness actually means

A short list, drawn from this fortnight:

Long-retention logs, preferably off-host. Six months minimum. A year is better.

Structured logs that can be queried after the fact. The structured-logging discipline I started a year ago has become much more important now. Unstructured logs are hard to investigate; structured logs are tractable.

Time synchronisation across all hosts, with NTP, monitored for drift. If your timestamps do not match across hosts, you cannot correlate events.

A snapshotting procedure for compromised hosts, with the tooling pre-staged. Trying to invent a procedure during an incident is the wrong time to invent it.

A documented incident-response procedure, with named roles and named steps. The procedure does not have to be elaborate; it has to exist and be followed.

Practice. The first time you run the procedure should be on a drill, not on a real incident. I have not been doing drills. I am going to start.

Off-site backup of forensic-relevant data. Logs, configurations, system state. If the original host is compromised, the off-site copy is the trustworthy reference.

A note on the human dimension

My friend has been in a difficult few weeks. The technical aspects of the response are tractable; the organisational and personal aspects are harder. The conversations with customers, with regulators, with their own management are draining. The self-blame about "I should have caught this earlier" is significant.

For anyone reading this who is in or has been in a similar position: the response is not about the speed of the technical work. It is about doing it methodically, with the right people involved, with the evidence preserved. The technical clean-up is the smallest part of the eventual answer.

For my own work, I am going to write about forensic readiness more, prevention slightly less. The balance has been wrong. Time to fix it.