The Hedgehog SOC has been running monitoring engagements for eight weeks. The first client started on 27 August, the second came in on 17 September, the third is being onboarded this week. We have two analysts plus me on operations; the third analyst joins next month. The technical stack is broadly what I described in May — Splunk for the SIEM, OSSEC on hosts where we have agent permission, Snort at the boundary, Bro on the larger client networks — with the practical adjustments that always emerge once a build hits production. I want to write down what eight weeks has taught us, because the lessons are sharper than I expected and several of them are not what the SOC-build literature emphasises.

The first thing is that the substantive tuning of detection content takes far longer than the vendor literature suggests. The Splunk searches I wrote during the build were tested against synthetic data and against my own home network; the production behaviour of those same searches against real client traffic has required significant tuning in approximately ninety per cent of cases. The signal-to-noise ratio on most of the original detections was uncomfortable in the first fortnight — false-positive rates of one alert per hour or worse on detections that need to be alert-quality if they are going to be actioned promptly — and the work of bringing the false-positive rate down to one alert per shift has consumed more of my time than the original detection-writing did. The lesson is that detection content is not finished when it works on test data; it is finished when it has been running against production for at least four weeks and the false-positive rate is low enough that the analyst attention has been earned.

The second thing is OSSEC's deployment friction. The agent-based host-intrusion-detection model is, for the kinds of clients we have been onboarding, harder to deploy than I had assumed. Two of the three clients run mixed Windows and Linux estates with active-directory-driven endpoint-management that does not, out of the box, deploy OSSEC. We have had to write deployment automation for both estates, and the deployment has taken longer than the SIEM build at one of the clients. The OSSEC operational realities — agent-server keepalive issues on intermittent VPN connections, log volumes that exceed the original sizing assumptions, occasional log-format mismatches against the detection rules — have all required iteration. The OSSEC mailing list has been substantially more useful than the official documentation; Daniel Cid's responses on the list have been the source of several operational fixes. The conclusion I have reached is that OSSEC remains the right host-IDS choice for our SOC, but the deployment cost is considerably higher than the open-source positioning of the project suggests.

The third thing is Bro. Bro 2.0, on the larger client network where we have it deployed, is doing things that no other tool in the stack is doing — the protocol-analysis depth and the ability to write detection logic against application-layer behaviour rather than against signature-style packet patterns is, in production, qualitatively different from Snort. Bro has caught two pieces of behaviour in the first six weeks that Snort and the SIEM correlation rules missed: a credential-stuffing pattern against a client's authentication endpoint that did not match any signature but produced an unusual rate of failed-then-successful login attempts from a small set of IP ranges, and an outbound C2 pattern from what turned out to be a compromised internal host where the C2 traffic was disguised as ordinary HTTPS but had a distinctive ratio of inbound-to-outbound bytes that Bro's connection-summary scripts flagged. Both are the kind of behavioural-pattern detection that the TTP-deep-dive piece I wrote in July was talking about, and Bro is, on present evidence, the right tool for that layer.

The fourth thing is the calibration of analyst attention. I have been writing the playbooks for the analysts, who have come from infrastructure-side work and not from a SOC background, and the calibration has been more difficult than I expected. The analysts are technically capable; what they are learning is the discipline of triage — how much time to spend on each alert, when to escalate, when to dismiss as false positive, when to dig deeper. The training content I drafted during the build has been useful but is being substantially revised based on what the production work has shown about where the analyst attention actually goes wrong. The two failure modes I have been working to correct are: spending too long on a single alert at the expense of others (which produces a backlog), and dismissing alerts too quickly because they look like prior false positives (which produces missed detection). The right behaviour is in between, and the calibration is a skill that takes weeks to develop. The structural answer is regular debriefs after each shift, with me reviewing the analyst's decisions and discussing why a particular call was right or wrong; this is more time-intensive than I had budgeted for but is producing measurable improvement in week-on-week metrics.

The fifth thing is what we have actually seen in eight weeks of monitoring. Each client estate has had at least one incident that required response. Two of the incidents were straightforward — a phishing email that produced a credential compromise (caught at the OSSEC layer when the compromised account began running unusual commands), an outbound-DNS-anomaly that traced back to an unmaintained workstation infected with a commodity remote-access trojan (caught at the Bro layer when the workstation began making an unusual volume of DNS queries to a low-reputation domain). The third was less clean: a sustained pattern of authentication attempts against a public-facing service that we eventually concluded was a slow credential-stuffing attack using credentials drawn from the LinkedIn dump in June. The credential-stuffing case is the operationally important one because it took longer to detect, required more correlation across data sources to characterise, and produced the most useful internal lessons about how the detection content needs to evolve.

For the practice as a whole, the SOC is now contributing to the client conversations in ways that the advisory work alone could not. The three clients running monitoring have all moved their CISO-level conversations from "do we know if we are being attacked" to "we know what we are seeing, here is what we are doing about it, here is what we are not yet seeing", which is a structurally different conversation. This was the goal of the build; the goal is, on present evidence, being met.

The next post is probably the privacy-and-encryption methodology piece I have been drafting through the autumn for the engagement team, or whatever surfaces from the South Carolina Department of Revenue disclosure that hit the press yesterday and which is, from what I can see, going to be one of the larger US public-sector breaches of 2012.


Back to all writing