Reasoning models: what o1 changes for SOC work

Post 7 of the AI series. OpenAI's o1 launched in September with a different model shape — *think longer, reason step by step*. What this means for the SOC, where the gains are real, and where EmilyAI's purpose-specific architecture continues to win.

_Post 7 of the AI in cyber series._

OpenAI announced o1 in September and shipped o1-preview to API customers shortly afterwards. The model is qualitatively different from the GPT family that preceded it. Rather than emit a response token-by-token from the prompt, o1 thinks before responding — generating an internal chain of reasoning that is hidden from the user, taking seconds to minutes per query, and producing answers that are materially better at problems requiring multi-step reasoning. Anthropic's extended-thinking variants and Google's reasoning models followed within weeks.

This is the most significant model-shape change since GPT-3.5. The next sixteen months of AI conversation will be heavily about reasoning models. This post is about what they change for SOC work specifically, and how the comparison to EmilyAI's purpose-specific architecture looks now.

What reasoning models are, in one paragraph

A reasoning model is, in essence, a large language model that has been trained to produce a reasoning trace before its final answer. The trace can be long — thousands of tokens, taking tens of seconds to produce. The trace is not normally shown to the user; it is the model working things out. The final answer is conditioned on the trace. For tasks that benefit from step-by-step working — mathematics, coding, structured analysis — the improvement over conventional LLMs is large. For tasks that do not — chatty conversational responses, simple lookup — there is no real benefit and the latency penalty is real.

Where this helps in security

Three categories of SOC work where reasoning models produce a real gain.

Complex alert correlation. Given a sprawl of related-but-not-clearly-connected events across multiple systems, asking the model what story do these events tell is now answerable with materially better reasoning than the GPT-4 generation produced. Multi-step correlation — if this is true then look here, if that is true look there, the combination suggests this — was where GPT-4 was visibly thin. o1 is visibly less thin.

Threat-hunting hypothesis generation. A threat hunter starts with a hunch — we may have lateral movement happening through service accounts in finance — and needs a hypothesis tree, an inventory of indicators to check, a prioritised list of queries to run, an interpretation framework for the results. This is multi-step structured reasoning of a kind GPT-4 could do badly and o1 can do better.

Incident reconstruction. Here is a sequence of events. Here is the team's notes from the response. Here is what the attackers ended up doing. Reconstruct the most likely timeline. Multi-step inference from partial information is exactly the shape of work reasoning models help with.

Where this does not help

Three categories of work where the reasoning model offers little or no advantage.

High-volume routine triage. The 50,000 alerts per day case from the copilot wave post. Reasoning models are too slow and too expensive to apply to every alert. The latency is wrong; the cost is wrong. The pre-triage funnel that EmilyAI runs — deterministic suppression, then a quantised distilled model at single-digit-millisecond latency — is the right shape for this volume and reasoning models are not.

Deterministic verdict generation. A reasoning model is still non-deterministic at the output layer. The model's reasoning trace varies. The final answer varies subtly. The reproducibility problem from post 2 of this series is unchanged by the reasoning capability.

Continuous learning from feedback. The reasoning model is, in normal commercial use, a frozen artefact from the provider. The feedback loop where a closed case becomes labelled training data does not apply.

What changes about the funnel

A specific observation about EmilyAI's architecture in light of reasoning models.

The three-stage funnel — deterministic suppression, CPU pre-triage, GPU analyst inference — is not a reasoning-model design. It is a high-throughput classifier design. Reasoning models would be the wrong tool for the work the funnel does.

But the funnel can usefully call out to a reasoning model for the small subset of escalated cases where multi-step reasoning helps. When EmilyAI escalates a case to a human analyst, the analyst can — and increasingly does — use a reasoning model as part of their case-handling tooling. The reasoning model is not replacing EmilyAI; it is augmenting the human analysts who handle the cases EmilyAI escalates.

This is the same architectural principle as the open-source LLM augmentation pattern from post 5. The analyst is the deterministic core; the LLM (reasoning or otherwise) is the augmentation tool the supervising human reaches for. The shape is durable across model generations.

A concrete example

To make this less abstract. Consider an alert chain from a customer where EmilyAI produced a verdict of medium-priority, possible lateral movement via compromised service account. The verdict is correct as far as it goes; the human analyst takes the case.

The analyst's question is: what is the most likely full attack narrative here, and what should I be checking next? This is a reasoning-model question. The analyst feeds the relevant context — the alert chain, EmilyAI's structured verdict, the customer's network topology, the service account's normal pattern of use — into a reasoning model and asks for a hypothesis tree.

The reasoning model produces a structured answer that prioritises three lines of investigation. Two are obvious; one is non-obvious and turns out to be the right thread to pull. The analyst spends thirty minutes confirming the hypothesis instead of the two hours they would have spent without the reasoning model's help. The verdict stays with EmilyAI; the investigation efficiency comes from the reasoning model.

This is, in my view, the right shape of integration between purpose-specific cyber AI and general-purpose reasoning AI. The two layers do different jobs. Neither tries to do the other's.

What the vendor copilots are doing

The major copilot vendors have been integrating reasoning models into their products through the autumn. Microsoft Security Copilot has access to the OpenAI reasoning models. CrowdStrike Charlotte's underlying model has improved. SentinelOne Purple too.

The improvement is real, in the augmentation layer. The structural points from post 3 remain — these tools augment analysts rather than do tier-two triage themselves. The reasoning capability makes the augmentation more useful, particularly for the threat-hunting and incident-reconstruction tasks. It does not move the structural framing.

What I think happens next

Three short-term predictions.

Reasoning models will become the default model class for cyber AI augmentation within a year. The latency and cost will drop. The reasoning quality will continue to improve. Within twelve months, asking am I getting reasoning-quality output will be the question rather than am I getting an LLM answer.

A separate class of inference-optimised small reasoning models will emerge. Open-source community pressure, plus the obvious need for cheaper inference, will produce small reasoning models that do most of the useful work at a fraction of the cost. The 7B-parameter distilled reasoning model is the shape to expect.

The purpose-specific cyber AI tier will not be displaced. Reasoning models are augmentation tools. The verdict-making tier — what EmilyAI does — will continue to be a different category. The two will coexist, with the boundary between them clear in well-designed systems and blurred in poorly-designed ones.

What is next

In seven weeks: the agentic AI question. The agent demos through 2024 have been remarkable. The agent in production cyber operations is, so far, not the same thing. Why the gap, what closes it, and what shape the agent that actually ships will take.