Six years of EmilyAI: what we kept, what we changed, what we should have done sooner

Post 20 of the AI series. A longer reflective piece. Eight years on from the first sketch of the system that became EmilyAI, six years on from production deployment, the architectural retrospective the series has been building toward.

_Post 20 of the AI in cyber series._

The first technical sketch of the system that became EmilyAI was a one-page document on a Hedgehog whiteboard in early 2018. The platform reached production deployment for our first MDR customer later that year. The six years since have produced eight years of architectural discipline and a system that, in its essentials, still looks like the original sketch.

This is the longer reflective post the series has been building toward. What we kept, what we changed, what we should have done sooner. I am going to be honest about the decisions that turned out badly as well as the ones that turned out well, because the honest version is more useful than the marketing version.

What we kept

Four architectural decisions that have, against all expectations, held up unchanged through six years and seven generations of underlying AI capability.

The hexagonal pattern. The decision to make the analyst core SIEM-agnostic via inbound and outbound adapter rings was made because we did not want to be tied to any one SIEM vendor. The benefit has been much larger than that — the schema as the single point of coupling has been the most useful piece of internal documentation, the replay safety has made every model upgrade safer, the cross-tenant intelligence model in v1.1 was a clean schema extension rather than an architectural rewrite. The pattern has been on the right side of every architectural call I have looked back at. The post 4 hexagonal lesson covers it in detail.

Deterministic INT8 inference. The 2018 decision to run the analyst at INT8 precision was made for hardware-economics reasons — we could fit more model on cheaper GPUs. The downstream property turned out to be bit-identical reproducibility, which has paid back in regulatory defensibility several times in the past two years and which is now a procurement question the wider market is asking. The decision was sound for the wrong reasons; the right reasons emerged later.

The continuous feedback loop. Capturing analyst dispositions back through the outbound connectors and turning them into labelled training data was something we did because we wanted the platform to improve over time. The volume of accumulated labelled data through six years is now substantial, and the model improvements from that data are the largest single contributor to the platform's improvement over time. Post 11 on continuous learning at scale covers the operational shape.

The single-tin deployment posture. The decision to ship the entire platform on a single Dell server, rather than as a distributed cloud-native deployment, was made because we wanted to be able to sell to customers who would not put data in a hyperscaler. The benefit has scaled in ways we did not anticipate — the operational simplicity, the audit-grade physical control, the cost predictability. The hyperscaler default has become so unquestioned in current enterprise architecture conversations that the single-tin posture sometimes needs explaining. The argument for it remains as I described in post 15.

What we changed

Three substantive changes to the architecture over six years.

The schema went from v1.0 to v1.2. v1.0 was the original canonical schema. v1.1 added cross-tenant intelligence with the seven privacy principles from post 12. v1.2 added the interaction-and-hunting ring covering Slack, Teams, voice, SMS, email, plus hunt findings as first-class cases. The schema versioning discipline meant each step was backwards-compatible; we have not had to break customer deployments to evolve. The discipline turned out to matter more than I initially thought.

The inference tier changed hardware twice. First-generation EmilyAI ran on a single mid-range NVIDIA card. Second-generation moved to dual T4s. Current production is dual L40S with champion-and-challenger model layout. Each transition was a hardware refresh with the model artefact unchanged at the architectural level; only the compilation target shifted. The principle — deterministic INT8 inference on dedicated hardware — has been the same.

The pre-triage classifier went from a research artefact to a CPU-side production component. The Intel AMX backend on the host CPUs lets us run the quantised distilled pre-triage model at single-digit-millisecond latency on CPU resources we already had. This was added in 2022. It reduced the load on the GPU tier by an order of magnitude and made the funnel's economics work properly for higher-volume customers.

What we should have done sooner

Three honest we got there eventually observations.

The cross-tenant intelligence model. The principles that became schema v1.1 — anonymisation, threshold-and-cool-off promotion, consent as data — were ideas I had been thinking about since 2019. We shipped v1.1 in 2022. That gap was probably too long. The intelligence value of cross-tenant sharing was sitting on the table for three years before we operationalised it. Some of the delay was correct — we needed customers to be ready for the model and we needed the schema discipline to be mature. Some was caution that I think, in retrospect, was excessive.

Strong audit-chain shipping off-host. We did this from the start with reasonable discipline; we made it hash-chained and continuously shipped off-host with tamper-evidence guarantees in 2021. The earlier version was a good audit log; it was not a defensible-against-a-determined-attacker audit log. The retrofit was straightforward; the right time to do it would have been 2018.

A formal model-registry discipline. We had model versioning from the start. We had a formal model registry with cryptographic hashes and complete artefact lineage from 2020. The two-year gap was less consequential than it could have been because we did not have any model regressions in that period; if we had, the gap would have been painful. Formal registry, hashes, complete lineage is now a regulatory expectation; we got there in time, but only just.

What we got wrong

Two specific calls I think we got wrong over six years.

The interaction ring (Slack, Teams, voice, etc.) was originally an outbound concern. The first version of the platform treated the messaging integrations as notification surfaces — we will write to your case management system and we will also post a Slack message. We did not, at first, treat them as conversational surfaces — the analyst can converse with the platform through Slack, ask follow-up questions, request actions. The conversational dimension was added in schema v1.2 and required a clean architectural separation into its own ring. The right approach would have been to design for the conversational dimension from the start, even when we did not have customers asking for it.

We over-engineered some of the early privacy controls. The early audit-trail and consent-handling discipline was sound but was, in places, more conservative than the regulatory and contractual environment required. The result was that some platform features were slower to ship than they could have been because of self-imposed audit overhead. The right calibration would have been thorough audit-chain, lighter feature-level consent ceremony. Some of the overhead has since been simplified.

What I think we got right that the wider market missed

A short list, since this is the comparative anchor for the series.

Building for regulated deployment from the start. Most of the AI-in-cyber category through 2023-24 was built for the demo-and-rapid-prototyping environment and is now retrofitting the audit, lineage, and continuity disciplines that regulated buyers need. We built for those buyers from the start. The product cost was a year or two of slower feature velocity; the strategic benefit was being in production at scale in 2025 with customers whose regulators are now asking the questions we already had answers for.

Treating AI as engineering rather than research. The disciplines I have described through the series — schema versioning, hash-chained audit, deterministic kernels, champion-challenger rollout, model registry — are software engineering disciplines applied to an AI workload. They are not novel. They were not novel in 2018. The novelty in the AI-in-cyber category has been building serious production AI without these disciplines, and then trying to retrofit them. The retrofit is more expensive than the original work.

Refusing to chase the hype. I have been declining to add a chatbot interface to EmilyAI since 2022. There has been customer demand for it. The architectural integrity argument is that the augmentation layer — where chatbot affordances actually live — is a different concern from the analyst layer, and conflating them produces a confused product. The decision has been unpopular at moments and is, I think, still the right one.

Where this leaves the platform in 2026

Six years and three months from production deployment. The platform has been continuously in production at customers, with the same canonical schema (in three versions), the same deterministic-INT8 discipline, the same hexagonal pattern, the same single-tin deployment posture, the same continuous-learning loop, with hardware refreshes and connector matrix growth and schema extensions but without architectural break.

The customer-side conversations in 2026 are different from those in 2018. The regulatory environment is different. The wider AI conversation is different. The platform looks largely the same and the conversations it supports have got materially easier as the world has caught up.

What is next

In five weeks: the final post in the series. Where this is heading. The next eighteen months in AI and cyber, the agent age and the analyst still in the loop, and what I will be writing about when the series formally closes.