Continuous learning at scale

Post 11 of the AI series. EmilyAI has been learning from analyst feedback for six years. The LLM-as-frozen-artefact shape gets the operational properties of *the model that improves over time* structurally wrong. What that means in practice.

_Post 11 of the AI in cyber series._

A property of EmilyAI's architecture that I have referred to in passing through this series but have not yet treated directly: the continuous learning loop. Every closed case in the customer's case-management system flows back through the outbound connector and becomes a labelled training example. The label is recorded with a strength-of-signal weighting derived from the analyst's role, the elapsed time, and the depth of the notes. Labels accumulate as a by-product of normal operation. Retraining cycles use them.

The shape this gives the system, over six years, is qualitatively different from the shape of the LLM-as-frozen-artefact products that dominate the current cyber AI conversation. This post is about that difference, what it produces in practice, and why I think the continuous learning property is the second-most-important architectural decision in the system (after determinism, which post 2 covers).

What continuous learning actually requires

Three things, none of which are trivial.

A clean signal source. The label has to come from somewhere — the analyst's eventual disposition of the case. The signal source has to be unambiguous (the case was closed as malicious / benign / suspicious / informational), traceable to a specific human (who closed it, what role were they in, what was their experience level), and time-bound (how long elapsed between the AI's verdict and the human's closure). All three of these turn out to matter for the weighting.

A schema that carries the feedback as data. The system has to capture the analyst's feedback as a structured field, not as an unstructured note. EmilyAI's canonical schema has had a feedback block since v1.0 — the analyst's disposition is captured as structured data, the strength-of-signal weighting is computed from the metadata, and the result is queryable. Show me the cases my analysts closed as benign in the last quarter where the model said malicious is a one-line query.

A retraining pipeline that does not destabilise production. This is the hard part. The model that is in production is the one customers rely on. Replacing it with a freshly-trained model has to preserve the operational properties that customers have built workflows around. EmilyAI's champion-and-challenger architecture (described in post 6 of this series) handles this: a candidate model is trained, deployed to the challenger GPU, runs shadow inference on live traffic, and is compared to the champion's verdicts before any promotion decision. A bad model never reaches customer verdicts.

The combination of these three — signal source, schema support, safe retraining — is what makes continuous learning work as a property of the system, not just as a research claim.

What the LLM products do instead

The shape of the dominant LLM-based cyber AI products is, in operational learning terms, structurally different.

The model is trained centrally by the vendor on whatever the vendor's training corpus is. The customer's feedback is not, in most current products, part of the training loop. The model the customer uses today is, broadly, the same model every other customer uses today. Improvements come from the vendor's central retraining cycle, on the vendor's schedule, with the vendor's idiosyncratic blend of data.

Some products offer fine-tuning on the customer's own data. This helps. It is not the same as a continuous learning loop. Fine-tuning is a discrete operation, scheduled, expensive, and producing a new artefact that then itself becomes frozen until the next fine-tune. The latency between the analyst saw a pattern that should change the model's behaviour and the model's behaviour changed is months, not minutes.

Some products offer retrieval-augmented generation where the model can reference recent customer-specific examples at inference time. This is useful and is not the same as the model itself learning. The reasoning the model does over the retrieved examples is, on every query, fresh; the model has not internalised the pattern.

What six years of continuous learning produces in practice

Three specific properties that have emerged over time.

Customer-specific drift. Different customers see different traffic patterns. Over time, EmilyAI's model has shifted in customer-specific ways — what counts as anomalous on customer A's network is not what counts as anomalous on customer B's. The cross-tenant intelligence layer (schema v1.1) shares anonymised indicators across the fleet so that everyone benefits from individual customer experience without exposing the customers to each other. The result is a model that is broadly tuned by the whole fleet and locally tuned by each tenant's own feedback.

Robustness to changes in the threat landscape. When a new attack pattern emerges — and they emerge constantly — the analyst-feedback signal catches it on the first few cases. The model retraining cycle picks it up. Customer protection improves before the pattern is widely known in the public threat intelligence space. This is the property the cross-tenant model is designed for, and it has produced visible early-warning effects across the customer base several times since 2020.

Improvement that compounds. The accuracy improvements from year four to year five are not as large as from year one to year two, but the rate has not stopped. Each year adds an accumulated body of labelled cases that the next year's retraining benefits from. The compounding is slow and visible.

What it does not solve

Three things continuous learning does not, on its own, fix.

Concept drift in genuinely novel attack types. Patterns that have no historical analogue do not appear in the training data until after the first cases happen. The model is reactive, not predictive, on novel patterns. The human analysts above EmilyAI exist precisely for this reason.

Label noise. Analysts close cases incorrectly, in good faith, with incomplete information. Some fraction of the training signal is noise. The weighting model controls for this imperfectly. Periodic re-validation against held-out gold-standard cases is part of the discipline.

The cold-start problem for new customers. A customer joining the platform does not, initially, have feedback to contribute to their tenant's tuning. They benefit from the fleet model immediately; the customer-specific tuning takes months to develop. The trade-off is presented to customers at sale and is part of why the on-prem deployment loses the cross-tenant uplift (as covered in post 1) but still gets the fleet-tuned model on day one.

What this means for buyers

Two questions worth asking any AI security vendor in 2025.

Does the model learn from my feedback? If yes, in what time-frame, and how is that learning visible to me? If no, what is your update cadence, and what is your channel for me to influence what the model does?

If the model is updated, what changes for the verdicts on my historical traffic? This is the determinism-and-replay question from post 2 restated. The vendor whose answer is we cannot tell you has not engineered for the operational properties that matter at scale.

What is next

In six weeks: cross-tenant intelligence. The architecture that turns one customer's experience into another customer's protection without exposing either to the other. The privacy and engineering problem nobody talks about, and what we have learned from the v1.1 schema work.