Operating-system monoculture, viewed from a friend's incident

I have been helping a small consultancy with a network-wide compromise this month. The proximate cause was a single Windows vulnerability — patched, but not yet deployed across their estate. The structural cause was that essentially every host on their network ran the same operating system, the same configuration, with similar defaults. Once one host was compromised, the lateral spread was effectively unconstrained.

This is the monoculture problem I wrote about last year. Watching it play out at small scale has clarified my thinking. Worth writing about.

What happened

The consultancy is about thirty staff, all running Windows desktops, all on a single LAN with shared file servers. The infrastructure includes a couple of Windows servers (file, print, internal Exchange), an Internet-facing Linux gateway (which I had set up two years ago), and a handful of less-managed Windows machines for specific applications.

The sequence:

Day 1: An employee opened a file attachment from a contact's mail. The attachment was a trojan dropper. The contact's account had been compromised some time earlier; the mail looked legitimate.

Day 1, hours later: The trojan deployed itself, disabled the Windows update mechanism, opened a connection to its operator. The operator triaged the host, found it was on a corporate network, and sent commands to scan the local network for other vulnerable hosts.

Day 2: The trojan exploited a known SMB vulnerability against three other Windows desktops on the network. None had the patch applied. All three were compromised within minutes.

Day 2-4: Through compromised hosts, the operator obtained credentials cached in the network. Several admin credentials were among them. With admin credentials, more hosts were compromised.

Day 5: The internal file servers were compromised. Documents were quietly accessed and copied; a small selection were exfiltrated through the gateway, which I had configured to permit outbound HTTP.

Day 7: A user noticed unusual activity on their machine and reported it. Investigation began. By this point about 20 of the 30 hosts had been compromised in some form.

What stopped the spread

A few things, fortunately:

The Linux gateway, which I had set up to log all outbound traffic and to refuse outbound connections to certain destinations, had blocked some of the trojan's command-and-control attempts. The trojan had adapted by routing through cooperative compromised hosts, but the cleartext exfiltration was substantially reduced because of the gateway.

The structured logs forwarded to my friend's network had captured the spread in real time, even though nobody was looking at them. When investigation began, the logs were the source of truth for the timeline.

The non-Windows component of the network — the Linux gateway and a small Sun box running a legacy application — had not been compromised. They had different vulnerabilities, but the attacker's tools did not target them.

What did not stop the spread

Most of the structural defences a small consultancy might have:

Antivirus had been deployed on every host but with default settings and no central management. The signatures were behind by weeks; the scanning was not catching the trojan variant.

Patching was nominally enabled but actually inconsistent. The vulnerability that allowed initial compromise had been patched two months earlier; about two-thirds of hosts had the patch.

Network segmentation was minimal. Every host could reach every other host on TCP and UDP. The lateral movement was unconstrained.

Monitoring was absent on the Windows side. The trojan operated for a week before anyone noticed.

Authentication used Windows-network shared trust extensively. Once one set of credentials was compromised, many resources were accessible.

What the response looked like

The response took three weeks. The work I helped with:

Day 7-8: Identifying the scope. Which hosts were compromised? Which credentials were exposed? What was exfiltrated? The structured logs were the answer; reconstructing from the Windows hosts themselves was unreliable because the attacker had cleaned up locally.

Day 9-12: Containment. Compromised hosts were taken offline one at a time, rebuilt from clean media (no attempt to clean in place; full reinstall), and reintroduced with new credentials.

Day 13-19: Credential rotation. Every password used by every employee was rotated. Every shared service credential was changed. The shared trust between hosts was rebuilt from scratch.

Day 20-21: Notifications. The clients whose data had been exfiltrated had to be told. The legal advice was that the consultancy had a duty to inform; the conversation was uncomfortable but necessary.

The structural problem

The response would have been the same had the proximate cause been any single Windows vulnerability. The specific bug that allowed initial compromise is fixable; the structural conditions that allowed the spread are not.

Had the network been diverse — say, half Windows and half Linux desktops — the spread would have stopped at the Linux machines. The trojan's tools would not have worked on them. The compromised population would have been bounded by the homogeneous segment.

Had the network been segmented — even crude per-department subnets with firewalls between them — the spread would have stopped at the segment boundaries. Lateral movement would have required exploiting different vulnerabilities at each boundary.

Had the monitoring been off-host and active — alerts firing on unusual scan traffic from internal hosts — the spread might have been caught on day 2 instead of day 7.

None of these structural improvements are dramatic. None of them require advanced techniques. All of them would have substantially limited the damage. The reason they were not in place is the standard reason: cost, friction, and the absence of a previous incident to motivate the investment.

The trade-off, examined

Platform monoculture has real benefits that I should acknowledge honestly:

Easier to manage. A homogeneous Windows network has a single set of administrative tools, a single training requirement, a single support contract. Diversity multiplies all of these.

Lower software licensing costs in some scenarios. Volume licensing for a single platform produces price discounts that diversity does not.

Easier to hire for. Windows administrators are a larger pool than "administrators who can manage a mixed Windows-Linux-OpenBSD environment".

Better user experience consistency. Users on identical workstations can help each other and share configurations.

These are not trivial benefits. The choice of monoculture over diversity is not irrational; it has trade-offs in both directions.

The argument for diversity, in light of the consultancy's incident, is that the cost of homogeneity is paid in incidents and the consultancy paid theirs this month. The structural advantage of diversity is that incidents are bounded by the size of the homogeneous segment. A network that is 100% homogeneous has incidents that are 100% bounded. A network that is 50% homogeneous has incidents that are at most 50% bounded.

This is a probabilistic argument, not a deterministic one. Most networks will not have major incidents in any given year. Diversity is insurance against an event whose probability is low but whose consequence, when it happens, is large.

What the consultancy is doing now

The consultancy's response, in addition to the immediate clean-up:

Adding Linux desktops for some staff. Probably 25% over the next year. Not because Linux is intrinsically more secure, but because diversity is intrinsically protective.

Segmenting the network. Three subnets — administration, development, general office — with firewall between them and explicit allow rules for what crosses.

Centralised monitoring, looking at the gateway logs and at host-level events from a central server. The investment is modest; the visibility improvement is substantial.

Patching as a managed process, not a host-by-host responsibility. A central server tracks which patches are deployed where and produces reports on coverage gaps.

Off-host backup of every workstation, not just servers. The compromise demonstrated that workstation data was as important as server data.

None of this is dramatic. All of it is the unglamorous structural improvement that follows incidents. The cost is real but is amortised over years; the cost of not doing it would be paid the next time something similar happens.

What I am taking from this

For my own work and for the small organisations I help:

Recommend diversity as default architecture. Not 100% diversity (which has its own costs), but conscious choice to have at least one piece of every layer be different from the dominant choice. A Linux gateway in front of a Windows network. A non-Windows host in the file-server tier. A few non-default workstations.

Recommend segmentation as default architecture. Even crude segmentation — per-department subnets with allow-listed traffic between them — is a substantial structural improvement over flat networks.

Recommend off-host monitoring as default. Visibility into what is happening, by people who are looking at it, is the difference between a one-host incident and a network-wide incident. The cost of monitoring is modest; the cost of not monitoring is potentially the entire network.

Treat the non-incident period as the time to make these investments. During an incident, time is the scarcest resource. The investments only pay off if they are made before the incident.

A hard lesson, paid for by the consultancy. Worth writing down so the lesson is not paid for again.