The assist phase of the alert-triage ML work — Emily, as the team continues to call it despite my repeated suggestions of more boring names — has been running in production against analyst-facing decisions since the 6th of February. Six months of operational data. The quantitative metrics have continued to track within range of what the shadow phase predicted; the qualitative observations are the more interesting half of what we have learned, and they are the part I want to record properly.
The quantitative numbers, briefly. Agreement rate between Emily's high-confidence (>90%) classifications and the analyst's eventual decision: 94.2% across the customer fleet, with per-customer variation between 89% and 97%. The 89% case is one customer with a particularly atypical alert mix (industrial-control-system gateway monitoring, where the rule mix is unlike the rest of the fleet) and we are working separately on that customer's profile. Time-to-triage for alerts where Emily produced a high-confidence classification: down to a median of 47 seconds from a baseline of around 4 minutes 30 seconds in the pre-Emily era. Daily alert-triage capacity per analyst: up by approximately 40%. Analyst-reported workload satisfaction (we have been running a fortnightly survey since February): up substantially, with the principal reason given in the open-text responses being the freeing of attention from low-value triage.
The incident-grade-class precision and recall, which is the metric I have been most focused on, is the picture I want to write properly about. Emily's recall on incident-grade alerts has improved from 87% in January to 93% in May to 94% currently. The precision has improved from 41% to 67%. Both directions are improvements; the absolute level of the precision is now adequate for high-confidence auto-routing but not yet for auto-closure. The miss rate (6%) is the part that I am still uncomfortable with, and the analysis of the missed cases has been instructive: most of the misses are alerts that Emily classified as benign or false-positive but that the analyst classified as incident-grade on the basis of context that Emily does not have access to. The context is, typically, threat-intelligence material correlated with the alert by the analyst — a known-bad indicator, an active campaign tied to a specific industry vertical, a customer-specific concern that the analyst is aware of from a prior conversation. The model does not see those contextual signals.
The remediation for the context-gap problem is the work this autumn. We are building a feature pipeline that pulls threat-intelligence indicators and customer-specific context into the model's input space. The technical work is, in Splunk-and-Python terms, not difficult; the harder work is the curation of the threat-intelligence content and the management of its currency. Indicators decay; campaigns conclude; vendor-level intelligence quality varies. The process side of this is going to drive a substantial part of the next year's work and possibly an additional headcount on the team.
The qualitative observations are more interesting. Three things have surfaced that I did not predict in January.
First, the model has surfaced inconsistencies in analyst classification across the team. When Emily and an analyst disagree, the analyst is asked to record the reason. The reasons, aggregated, have shown that different analysts apply different decision rules to similar alerts, and that some of those rules are formally inconsistent with the customer's playbook. The patterns are not, generally, evidence of poor analyst work — they are evidence of the playbook being incomplete or of the analysts having developed local heuristics that are reasonable but not formally documented. The model has, in a sense, become a continuous audit of the consistency of the team's classifications, and the result has been a substantial revision of the customer-facing playbook documents over the past six months. This was not the design intent but is, in retrospect, an important capability.
Second, the model's confidence distribution has become a useful triage artefact in its own right. Alerts where Emily's confidence is in the middle range (40-70%) are, on the analyst feedback, the alerts where the analyst's own decision is hardest. The model is, structurally, identifying the alerts that benefit most from analyst attention, separately from its classification work. We are now using the confidence band as a routing signal: high-confidence false-positives go to a junior analyst for spot-check; high-confidence incident-grade goes to a senior analyst for direct response; mid-confidence alerts go to a tier-2 reviewer. The throughput improvement from this routing is partly distinct from the classification accuracy improvement.
Third, and most surprising, the analysts have started using Emily as a kind of working-memory aid. The model's classifications and the per-feature attribution data (we have integrated SHAP-style feature attribution into the analyst interface) make it easier for an analyst to articulate why they classify an alert the way they do. The senior analysts in particular have reported that the explainability layer has changed how they document their reasoning in case notes. Several of the senior analysts have started using the feature-attribution view in customer briefings. The capability has migrated, somewhat, from "decision aid" toward "communication aid" in ways that the original product specification did not contemplate.
For the operational deployment, the next steps are clear. The threat-intelligence integration work for the autumn. The auto-closure path on customer agreement, with confidence thresholds set conservatively. The model retraining cadence is currently weekly and we are looking at moving to daily for the high-volume customers. The interface refinements continue based on analyst feedback. The long-running shadow-deployment data continues to be collected.
The strategic conversations are starting, separately. Two of the customer organisations have asked whether they could license the Emily capability for use in their own internal SOCs. The answer at the moment is "not yet, the productisation work has not been done and the deployment posture in your environment requires more careful handling than what runs on our infrastructure". The medium-term answer, however, is interesting. The build-or-buy posture for SOC tooling is shifting, and the question of whether what we have built is a Hedgehog operational capability or a Hedgehog product capability is one I want to think about more carefully through the second half of the year. There is a business decision in there that I have not yet made.
The team. Two ML engineers, the postgraduate intern (now full-time and effectively running the model-update pipeline), and the senior SOC analyst who works half-time on the Emily integration and the playbook revision work. The team is, quietly, becoming the most productive engineering function in the company. I want to make sure that productivity is sustainable rather than burnout-driven, and the autumn planning will include that explicit conversation.
I will write more on this as the autumn deployment work progresses. The model continues to improve. The analysts continue to teach it things. The capability continues to grow.