EmilyAI pilot data, four months in

The EmilyAI commercial pilot has been running across three customer deployments since the first cutover on the 24th of April. Four months. The operational data is sufficient to write up.

The model-quality numbers across the three pilot customers. Agreement rate between Emily's high-confidence (>90%) classifications and the customer-analyst's eventual decision: 92.4%, 94.7%, and 88.1% — the three figures span a range that is wider than the internal-deployment range was, and the lower figure is from the customer with the most atypical alert mix (financial-services-specialist customer with a heavy concentration of payment-fraud-detection rules that the original training data does not weight strongly). Per-customer adaptation work is the active area of engineering attention. The middle figure (94.7%) is the customer most similar to the internal-deployment fleet, and is consistent with what the internal data predicted.

Per-customer triage-time reduction: 33%, 41%, and 28% on median time-to-classification for high-confidence model classifications. Customer-analyst-reported workload satisfaction has improved at all three. None of the three customers have reported any incident-grade alerts that the model classified as benign and the analyst missed; the operational floor for the assistance is therefore, on four months of data, the model is not creating new incident risk by silencing alerts, which was the primary safety concern at the start of the pilot.

The customer-side findings are the part of the data that is more interesting than the model-quality numbers. Three things have surfaced.

First, the model has, in each customer deployment, surfaced inconsistencies in the customer's own playbook — the same pattern we observed in our internal deployment. Different analysts at the customer have applied different decision rules to similar alerts. The model's flagging of these inconsistencies has driven, at each customer, a substantial revision of the customer's own internal playbook documents over the first two months of the deployment. The capability is, again, working as a continuous audit of classification consistency, and the customers have been finding the audit valuable independently of the workload-reduction value. We are now formalising this — the next product release will include a "playbook drift" report as a first-class feature, surfacing the model's identified inconsistency clusters to the customer's SOC manager.

Second, the threat-intelligence-integration feature that we built in the autumn of 2017 is now performing materially better than the version that ran on internal data alone. The customer-deployments add three things: the customer's own threat-intelligence subscriptions (which vary across pilots and include vendors we do not work with internally), the customer's own indicators of compromise from prior incidents at that customer, and the customer's industry-vertical threat-intelligence (financial services, manufacturing, retail). The aggregate of these adds substantial signal that the model uses for classification decisions on incident-grade alerts. The precision-recall curves for the incident-grade class are noticeably better at all three pilot customers than the internal-deployment baseline.

Third — and this is the one that has me thinking about the product roadmap — the customers have started using the model's output for purposes other than the original triage assistance use case. One customer has started feeding the model's per-rule precision and recall metrics into their detection-engineering programme to identify which Splunk correlation searches are producing low-value alerts and either need tuning or retirement. Another customer has started using the model's confidence distribution as input to their analyst-staffing model — high-confidence days require fewer analyst hours, and the customer is seeking to use Emily's confidence forecast (a small additional capability we built for them in June) to plan staffing. The third customer has started using the playbook-inconsistency reports as a continuous-improvement input to their analyst-training programme. None of these uses were anticipated in the original product specification; all three are the kind of customer-driven extension that suggests the product surface is broader than we initially thought.

For the engineering team, the implication is that the next year's roadmap is more ambitious than the previous year's. The product-management function is now formalised; the customer-success function is hiring; the engineering team is at seven full-time. The development cadence has settled into a roughly monthly release cycle for the platform with continuous iteration on customer-specific configuration. The technical debt that has accumulated through the pilot deployment — multi-tenancy plumbing that was rushed for the first customer, the model-update pipeline that has grown organically, the integration layer for the various customer-side SIEMs — is being addressed in a planned refactoring through Q4. The product is, on the operational measure, succeeding.

The strategic conversation about the company shape is now more concrete than it was in January. The product business is, on the four-month signal, going to scale. The services business continues to grow but the rate of growth is unsurprising and roughly tracks UK security-services market growth. The product business has, on the customer-acquisition side, three additional customers in active commercial conversation for Q4 deployment. The pricing has been validated against the customer-experienced value and is converting. The institutional-capital conversations from Q1 have firmed up and are, in some cases, on the edge of being concrete. The decision about whether to take outside capital is, I think, going to land in Q4 or early 2019. The team's preference is to remain bootstrapped if the growth supports it; the operational requirement of Q4 hiring and platform-build will force the question.

For the wider security-research community, the early-Emily research paper has been submitted to the USENIX Security Symposium for the 2019 conference cycle — submission deadline was the 14th of June, response is expected in October. The paper covers the model architecture, the alert-feature pipeline, the analyst-decision-data corpus structure, the precision-recall analysis, and the operational findings from the internal-deployment phase. The customer-deployment data is excluded from the paper for confidentiality reasons; the customer findings will be a separate piece of writing in time. The paper has been written by the lead engineer (with my co-authorship and editing) and is the first peer-reviewed academic submission of the team's work. Whether it lands at USENIX or at a different venue, the discipline of writing it has been useful for the engineering team's articulation of what they have built.

I will write more as the pilot data continues to accumulate. The product is, on the four-month signal, the right thing.