Distributed scanning, observed at length

I have been keeping detailed records of scan traffic against my honeypot for three months. The dataset is now large enough — about 3,200 distinct scanning sources — to draw some preliminary conclusions about how distributed scanning actually operates in 2000.

This is a different question from distributed denial of service. DDoS uses many sources to attack one target. Distributed scanning uses many sources to survey many targets. The patterns are different and the defensive implications are different.

The pattern in the data

The most striking finding: scanning sources do not distribute uniformly across the address space. They cluster.

A scanner from IP X often scans a specific subset of my honeypot IP space. A scanner from IP Y scans a different subset. If you plot which IPs were scanned by which sources, the structure is not random — sources are scanning coordinated portions of the target space.

This has been particularly visible for two specific scan campaigns I have tracked:

Campaign A (April-June): about 200 distinct sources, hitting my honeypot's IPs in carefully non-overlapping ranges. Each source scanned a small fraction of the address space; collectively they covered the whole space; no single IP scanned more than ~5% of my honeypot space. The footprint of each source was small enough to not look anomalous in isolation.

Campaign B (May-July): about 400 distinct sources, similar non-overlapping pattern, but targeting only port 80 (HTTP) rather than the broader port scan that Campaign A used. The sources are different from Campaign A's sources.

Both campaigns show coordination. The non-overlapping pattern is essentially impossible from independent scanning sources.

What is producing this

The most likely explanation is botnet-coordinated scanning. A central controller maintains a list of address ranges to scan and assigns ranges to a population of compromised hosts. Each compromised host scans its assigned range and reports back. The controller aggregates the results.

This is structurally similar to Stacheldraht-style DDoS architecture — central master, distributed agents, coordinated activity — but used for surveillance rather than attack. The same compromised-host populations that produce DDoS attacks are also producing scan campaigns.

The operational advantage to the attacker is significant. A single scanner-IP that surveys the whole internet is loud and easily blocked. A coordinated swarm of scanners, each touching only a small fraction of the target space, is quiet enough to mostly evade per-source detection.

What the campaigns are looking for

Campaign A's port distribution suggests broad-spectrum reconnaissance: TCP 21 (FTP), 22 (SSH), 23 (Telnet), 25 (SMTP), 53 (DNS), 80 (HTTP), 110 (POP3), 111 (RPC portmap), 139 (NetBIOS), 443 (HTTPS), 445 (SMB), 1433 (MSSQL), and a handful of others. This is the standard "every interesting service" sweep.

Campaign B's targeted port-80 scanning is more interesting. The sources are hitting only HTTP, with a mix of HEAD and GET requests. Several requests appear to be looking for specific known-vulnerable web applications — particular CGI scripts, particular CMS endpoints, signs of known-exploitable framework versions.

This is targeted reconnaissance: scanning for specific vulnerabilities the attacker can exploit. The attacker is presumably building a list of vulnerable hosts they can later attack from a different (and more privileged) source.

What this implies about defenders

A few things, in increasing order of consequence.

Per-source rate limits are insufficient. A scanner doing 100 requests over an hour from a single source is unlikely to trip any rate limit. A coordinated swarm where each member is doing 100 requests per hour, scanning collectively, is essentially undetectable to per-source heuristics. Detection requires aggregate analysis across many sources.

The compromised-host pool is the structural problem. Each scanning source in a coordinated campaign is a host that someone else owns and that has been compromised. Reducing the pool of compromised hosts is the only way to reduce the supply of coordinated-scan capacity. This is the same observation I have been making about DDoS and worm propagation. The structural problem is the same.

Your IP shows up in someone's database within hours of being assigned. New hosts on the public internet are scanned, on the available evidence, within hours. Whoever is operating these scan campaigns has comprehensive databases of what is at every IP. New deployments do not get a grace period.

The defensive shift is from prevention to detection. Preventing your IP from being scanned is essentially impossible. Detecting that you have been scanned, and what specifically the scanner was looking for, is feasible. This is exactly what the Honeypot is for. Operators that are not looking at scan logs are missing the leading indicator of attacker interest.

What I am doing about it

For my own infrastructure: I have updated my structured-log analysis to do per-target-port aggregation across sources. The query "how many distinct sources hit port 80 with HEAD requests in the past 24 hours" is now a one-liner. Trends in this metric are leading indicators of campaigns I should be aware of.

For my own writing: I am going to feed a sanitised version of this data into my honeypot writeups for the Honeynet Project. The aggregate patterns are the kind of intelligence that benefits from being shared.

For anyone reading this who runs internet-facing infrastructure: pay attention to the aggregate of scan attempts, not individual events. The shape of the aggregate is informative even when each individual scan is mundane.

A small architectural note

The coordinated-scanning observation has implications for my honeypot v2 architecture. The current design uses a single public IP. With coordinated scanning, the single-IP observation is an artefact: I see the slice of the campaign that touches one IP, not the whole campaign.

A more useful observation point would be a range of IPs — even a small /28 — observed simultaneously, with cross-IP correlation visible. This would show me the coordination pattern directly rather than inferring it from temporal patterns.

The Honeynet Project's emerging tooling supports exactly this. I am thinking about whether I can get a small block of IPs added to my honeypot deployment. The cost is a slightly more complex network configuration; the benefit is the ability to see scan campaigns directly. It is a project for the second half of the year.