Deterministic inference: the property the market is losing

Post 2 of the AI series. Same input, same output, every time. A property that used to be table stakes in production systems and that LLM-based security tooling has quietly let go of. Why it matters and how EmilyAI is built to preserve it.

_Post 2 of the AI in cyber series._

A property that used to be table stakes in any production system has, over the past eighteen months, quietly become an optional extra. The property is determinism: the same input, fed through the same software, producing exactly the same output every time. For most of computing history this was so basic it was not worth naming. The LLM era has made it remarkable enough to need writing about.

This is the first comparative post in the series and it is about why deterministic inference matters in a SOC context, why most of the new AI security tooling does not have it, and how EmilyAI is built to preserve it.

What is happening in the wider market

The wave of AI for security products that arrived in 2023 — Microsoft Security Copilot, CrowdStrike Charlotte AI, SentinelOne's Purple AI, Google's Sec-PaLM, and a long list of smaller vendors — is built on or around large language models. The LLMs do useful work: summarising alerts, translating queries between natural language and SIEM syntax, writing first drafts of incident reports, exploring the surface area of a complex log dataset.

What they generally do not do — and, in their current generation, structurally cannot do — is produce the same output twice for the same input. Even at temperature zero, the same prompt to a hosted LLM produces subtly different responses across runs, because the inference is sampled from a probability distribution and because the providers reserve the right to alter the underlying model without notice. The promise of we ran this prompt last quarter and got verdict X cannot be cashed out with confidence. The model that produced X may not exist any more.

For some categories of use, this is fine. Help me explore this dataset tolerates non-determinism. Write a draft incident report from these notes tolerates it. Tell me what this PowerShell snippet does tolerates it, though the tolerance is shallower.

For other categories, it is not fine. A verdict on whether a specific event was malicious, fed into a regulated firm's incident reporting workflow, needs to be reproducible. The audit committee may ask, six months later, why was this event closed as benign? The correct answer cannot be because the model's probability distribution was sampled that way that day.

Why deterministic inference is hard in 2024

The default behaviour of most modern AI inference stacks is non-deterministic for two reasons.

The first is the sampling step in language models. Even with deterministic decoding strategies (greedy decoding, beam search with fixed beam), small numerical differences from floating-point non-associativity in GPU reduction trees can perturb the output token sequence. The reproducibility should be there in principle and frequently is not in practice.

The second is the architecture of hosted AI services. The provider can change the underlying model at any time. A query to the gpt-4 endpoint in January 2024 is not running against the same model as a query to gpt-4 in October 2023, despite the endpoint name being unchanged. The audit trail you collect about which model produced this answer is, in many cases, a model alias rather than a fixed identifier.

A small number of vendors are now offering version-pinned endpoints. Even there, the implementation details are opaque and the contractual reproducibility guarantees vary.

How EmilyAI handles it

The decision to make EmilyAI deterministic was made in 2018 for engineering reasons that turned out to be regulatory ones. Three properties combine to deliver it.

INT8 quantisation throughout. Both the CPU pre-triage classifier (a distilled model exported as INT8 ONNX, running against Intel's AMX backend) and the principal analyst model (compiled to TensorRT INT8 with calibration-driven quantisation and the engine builder's deterministic flag set) run at 8-bit integer precision. Integer arithmetic is associative; floating-point GPU reductions are not. The quantisation choice is therefore both a performance and a reproducibility choice.

Deterministic kernel selection. Reduction-tree algorithms that would otherwise admit floating-point non-associativity are pinned to deterministic kernels in the TensorRT build. The default is the fast path; the deterministic path is slower by single-digit percent and produces bit-identical output.

A model registry that pins specific binaries. A verdict produced by EmilyAI references a specific compiled model artefact with a specific cryptographic hash. Which model produced this verdict is auditable down to the binary. Models can be retrained, recompiled, and rolled out — and the rollout is itself a versioned event that anyone can later reconstruct.

The consequence: re-running any historical event through EmilyAI, against the model version that processed it originally, produces the exact same output as before. Bit for bit. The verdict from eighteen months ago can be reproduced today, on a different machine, and the audit trail will match.

Why this property is worth defending

Three reasons it matters in security operations specifically.

Regulatory defensibility. When the ICO is calibrating on the controls that were in place and the SEC has charged CISOs over the accuracy of public security statements, the ability to defend a specific past decision — here is the model that produced that verdict, here is the input, here is the output, you can verify it for yourself — is materially different from here is what the model thought at the time, we cannot reproduce it now.

Incident reconstruction. When an incident is being investigated, the question of whether the SOC's automated systems flagged the precursor activity is not theoretical. A deterministic system can be re-run against the historical event stream and produce a precise answer. A non-deterministic system can be re-run and produce a different answer, and the difference between the model flagged it and the analyst dismissed it and the model did not flag it matters.

Regression testing. Releasing a new model version is, in a deterministic system, a clean comparison: same inputs, two outputs, easy to characterise the differences. In a non-deterministic system, you cannot tell whether the new model has actually changed behaviour or whether the apparent change is the run-to-run variance.

What the market is doing about it

A scattering of responses from the major vendors over the past year.

Some are publishing version-pinned endpoints with stronger reproducibility guarantees. The reproducibility is best-effort rather than bit-identical; the audit trail is improved.

Some are offering replay features — here is what the model said when this event was processed. The replay is a cached output rather than a re-execution; it preserves the historical answer but does not let you re-test against a new model version on the same data.

Some are not yet thinking about it. The vendors that ship their AI security product fastest are not, by and large, the vendors that have thought hardest about audit and reproducibility.

The question to ask your vendor

When you next evaluate an AI security product, three questions worth asking.

Is the inference deterministic? Same input, same output, every time. If the answer is no, the next question is whether you are content with that.

If the model is updated, can I still reproduce historical outputs against the model version that produced them? If the answer is no, your audit trail has a hole in it.

What does the audit trail look like for a single verdict? The right answer specifies the model artefact, the input record, the output record, and a timestamp. We logged that the model ran is not enough.

These are not difficult questions to ask. The answers reveal whether the vendor has thought about the operational reality of running their product in a regulated environment for years.

What is next

In six weeks, the Copilot for security wave specifically. The three dominant vendor framings, what they actually do, and where they sit on the spectrum from useful augmentation to expensive query interface.