First notes on machine learning for alert triage

The intern who started in February has been working through our archived alert-triage data for the past sixteen weeks, and we have something worth writing about in the early form. I want to record the work as it is now, before the optimism I am feeling becomes either confirmed or, more likely, qualified.

The data. We have, in Splunk, approximately three years of analyst-classified alerts from our SOC. Each alert has a source-rule identifier, a set of metadata fields (timestamps, source/destination, asset class, signature payload), the analyst's classification (true positive, false positive, benign, or one of a small number of pending statuses), the disposition reason recorded by the analyst as free-text, and the time elapsed between alert generation and final classification. The corpus runs to approximately 2.4 million classified alerts across the customer fleet, of which around 78% are false-positive, around 14% are benign, around 7% are true-positive non-incident, and around 1% are true-positive incident-grade. The class imbalance is the first thing that matters and the first thing that constrains the modelling choices.

The hypothesis. The analyst's classification of an alert is, in many cases, predictable from the alert's structured features alone. If a model can learn the analyst's decision function, it can pre-classify incoming alerts and let the analysts focus on the alerts the model is uncertain about. This is, in 2016 terms, a routine supervised learning problem. The literature for the specific application is thin — most ML-for-security work in the academic literature is focused on intrusion detection at the network layer rather than alert triage on top of existing IDS rules (NSL-KDD work, Tavallaee et al) — but the core method is well-trodden.

The approach. We have built a feature-extraction pipeline that takes a Splunk alert event, normalises the structured fields, and produces a feature vector of around 130 dimensions: source-rule one-hot, hour-of-day, day-of-week, geographic distance between source and destination, asset criticality from the CMDB join, source/destination port categories, payload-length buckets, and (for rules that include them) various signature-specific fields. The free-text disposition reason is processed through a simple bag-of-words encoding for the secondary outputs. We have trained, in scikit-learn, a gradient-boosted decision tree model (XGBoost) and a logistic-regression baseline against an 80/20 train/test split with stratified sampling on the disposition class. The model is updated weekly against a rolling window of the most recent twelve months of analyst data.

The results. The XGBoost model on the test split predicts the analyst's classification with around 91% accuracy on the four-way classification problem. The accuracy is not the most useful metric because of the class imbalance — predicting "false-positive" for everything would yield 78% accuracy without learning anything — and the more useful metrics are the per-class precision and recall, particularly for the true-positive incident-grade class, which is the class we cannot afford to miss. The model is at 87% recall on that class with 41% precision. The 13% recall gap is the part I am still uncomfortable with. The 41% precision is acceptable in context — false alarms on incident-grade alerts will be reviewed by analysts anyway — but the 13% miss rate is not. We are working on the feature engineering and the model architecture to push that down.

The interesting part is what the model has surfaced about the analyst data. The model's confidence distribution is bimodal: a large peak near 100% confidence (alerts that are clear false-positives by structured features alone, which the model classifies correctly with high confidence) and a smaller peak near 50% confidence (alerts where the structured features do not strongly indicate a class). The analysts spend most of their time on the 50% peak; the model is most useful at the 100% peak. The operational implication is that a triage assistant could, conservatively, pre-close approximately half the daily alert volume with high confidence, freeing analyst attention for the alerts that benefit from human judgement.

I am not yet ready to put this into production against customer-affecting alerts. The thing that needs to happen first is the precision-recall analysis on a per-customer basis — the rule mix and the analyst-decision patterns vary substantially between customers, and a model trained on aggregate data may underperform on any specific customer's profile. We are starting that work this month. The other thing that needs to happen is the explainability piece. SHAP values (Lundberg and Lee, "A Unified Approach to Interpreting Model Predictions" — preprint not yet published, but the precursor work on game-theoretic feature attribution is in the gradient-boosted-tree literature) will let us tell an analyst why the model classified a given alert the way it did, which I think is operationally essential for trust adoption. An opaque assistant whose decisions analysts cannot audit is a worse product than no assistant at all.

The wider thought. The model is, in a real sense, learning what our analysts know. The decisions encoded in three years of triage data represent the institutional expertise of a SOC team, and the model is making that expertise queryable and reusable in a way that the team's collective memory is not. The implications for SOC scaling, for new-analyst onboarding, and for the work of the senior analysts (whose time is currently consumed by junior-grade triage) are interesting and not yet fully thought through. I want to write more on this as the experiment matures.

For now, the operational status: the model runs in shadow mode against live alerts, the analysts make their decisions independently, and we compare the two. The model's predictions go into a separate Splunk index that the analysts cannot see during their decision. After triage, we record the agreement rate. This shadow phase will run through the autumn at minimum. If the agreement remains where it currently is, we will move toward a confidence-thresholded auto-close path on customer agreement, with the threshold initially set very high. The path from "interesting research" to "production capability" is long and I am being deliberate about not skipping the steps.

There is a name we have started using internally. It is a working name, but it has stuck. Emily. As in EMILY — Empirical Mining of IDS Lookup Yields, or some such backronym we will probably adjust over time. The intern picked it; the team uses it. I will keep using it in these notes as the project develops.