Running an incident: what actually matters when alarms turn into actions

I have led, sat in, or cleaned up after a great many incidents over the last decade and a half. The cases are wildly different from one another, but the things that go wrong are weirdly consistent. This piece is about those patterns. It is not a methodology document — there are plenty of those, and most of them are fine — it is about the texture of running an incident in practice, the bits that do not appear on the diagram.

The first thirty minutes

The single most important decision in an incident is whether it is, in fact, an incident. Alerts trigger constantly; incidents are rarer. The first thirty minutes is almost entirely about that triage call. The temptation is to leap into containment because containment feels like progress. Resist it until you can describe, in one sentence, what you think is happening and why. If you cannot write that sentence, you are not yet ready to act, and acting too early will burn the evidence you need to refine the sentence.

It helps to have an explicit person in the coordinator role from the moment a possible incident is declared. Not the most senior technical person — the most disciplined one. Their job is to keep a timeline, ensure that nothing is acted upon without being recorded, and prevent the room from collapsing into a parallel set of private DMs. Coordinators are not glamorous. The good ones are the difference between a clean response and a chaotic one.

Containment is not eradication

Containment buys you time. Eradication ends the problem. Conflating the two is one of the most common mistakes I see, particularly under executive pressure. We have isolated the host is not the same statement as we have removed the foothold from the environment, and treating it as such is how organisations end up reinfected three days later, in front of the same dashboards, with less goodwill from the business.

The rule of thumb I follow: contain enough to stop the bleeding, then deliberately stop touching things until you have a confident picture. Confidence here is not certainty — you may never reach certainty — but it is enough understanding of the attacker's footprint that eradication will not leave loose ends. Most reinfections are loose ends.

The room

Incidents are run by people, and people get tired, frightened, and snippy. Anyone leading response should be paying attention to the room as much as to the laptops. The incident I remember most clearly was not the most technically interesting; it was the one where I noticed, three hours in, that the lead analyst had not eaten or slept and was making decisions I would not have agreed with two hours earlier. We rotated him out, took a fifteen-minute break, and the response improved markedly. There is no methodology document that tells you to do this. You learn it by doing it badly once.

The other room dynamic worth naming is the executive bridge. Executives are not the enemy; they are usually trying to do their job, which involves saying useful things to other executives, lawyers, and regulators. They need a steady stream of carefully worded updates from a single source. Without that, they will manufacture their own narrative from rumour, which is bad for everyone. Pick the person who will own the executive bridge, give them clear scripts, and protect the response team from being interrupted by it.

The post-incident review nobody wants to write

Every organisation produces a post-incident document. Most of them are useless. They are written for the audit committee, they are written defensively, and they are written to extract a list of products to buy. The useful version is the internal one: small, honest, and absolutely focused on what would have made the difference. It does not blame individuals — it never blames individuals — but it does name controls that did not work, runbooks that were missing, and assumptions that turned out to be wrong.

I keep a small private template for this. The headings are: Timeline of what we knew, Timeline of what was actually happening, The gap, in plain English, What would have closed the gap by 30 minutes, What would have closed it by 24 hours, Three things we will change in the next 90 days. That is it. Nothing about products, nothing about RACI charts. Three changes, ninety days, accountable owner. Run that loop after every incident, and your defensive posture compounds.

The unsexy long game

Good incident response capability is built outside incidents. Tabletop exercises, joint rehearsals with the platform and identity teams, friendly fights between blue and red, and the unfashionable practice of actually testing your backup restorations. None of this is photogenic. All of it is what makes the next real one go better than the last one.