Honeypot annual recap: 2001

A full-year recap of honeypot data from 2001. The patterns across the year tell a clearer story than any single quarter did.

The data: about 10,000 distinct compromise attempts, ~80,000 distinct source IPs, several thousand human-attacker sessions captured by Sebek.

The annual shape

Four distinct phases visible in the data.

Q1 (January-March): baseline activity. Steady scan rate, occasional Linux-targeted activity around Ramen and Lion. About 15,000 distinct sources for the quarter.

Q2 (April-June): gradual increase. The build-up to Code Red. About 20,000 distinct sources.

Q3 (July-September): explosion. Code Red I and II saturate the internet. Nimda multiplies further. About 50,000 distinct sources for the quarter.

Q4 (October-December): elevated baseline. The Code Red and Nimda residuals continue to scan. About 35,000 distinct sources.

The Q3 peak is the dominant feature. The post-peak baseline is roughly twice the pre-peak baseline. The internet's attack-noise floor stepped up during 2001 and has not stepped back down.

The attack mix shift

Comparing Q1 with Q4:

HTTP-targeted attacks: 20% → 60%
NetBIOS/SMB scans: 25% → 15%
SSH brute-force: 10% → 8%
Outlook/mail-borne: 1% → 5%
Other (DNS, FTP, RPC, etc): 44% → 12%

HTTP-targeted attacks dominate the post-Code-Red landscape. The shift is structural — the worm-kit infrastructure is now optimised for HTTP exploitation; future worms are likely to use the same shape.

The "other" category has shrunk dramatically in proportional terms. Most of the absolute volume is unchanged, but the dominance of HTTP attacks has reduced everything else's share.

The Sebek captures

For the year, the high-interaction host caught approximately 50 sessions of human-attacker activity. The breakdown:

30 enumerate-and-leave (60%)
10 attempted persistent backdoor (20%)
5 attempted use as stepping stone (10%)
5 unusual or specialised activity (10%)

The distribution is consistent with my Q3 summary. The careful-attacker proportion is small but consistent. Outbound filtering continues to disrupt most attempts.

What I learned about specific patterns

Three generalisations from the year:

Saturation curves match the model. The worm-propagation arithmetic I sketched fits the data well at order-of-magnitude. The actual saturation times were within a factor of 2 of the model's predictions.

The compromised-host population is increasingly persistent. Worms with persistence (Code Red II, Nimda) produce compromise populations that survive for months. The substrate for future attacks is sticky.

Honeypot data is genuinely useful. The captures, the patterns, the distributional shifts all give me information I cannot get any other way. The honeypot continues to be the single most valuable component of my defensive stack for learning, even if it is not the most defensive component.

What I will do differently in 2002

Three things.

Add a high-interaction host behind another Honeyd persona. The single high-interaction host limits what I can capture. Two would let me run different OS personas and compare.

Improve the Sebek deployment. The current Sebek setup is rough; the captures are sometimes incomplete. Updating to the newer Sebek release will improve reliability.

Contribute more to the Honeynet Project. I have been a passive participant; the data I have is worth contributing actively. The cumulative-analysis paper I have been promising is overdue.

What 2002 should look like

Three predictions:

Continued elevated baseline. Code Red and Nimda residuals will continue scanning into and through 2002. The activity floor will stay high.

More multivector worms. The Nimda template will be reused. Future worms will combine multiple propagation vectors as default.

Continued HTTP dominance. Until the structural defences (better IIS defaults, URLScan ubiquity, better web-application security) catch up, HTTP-targeted attacks will continue to dominate.

More as the year develops.