Honeypot annual recap: 2004

Annual recap of honeypot data from 2004, following the pattern from 2002 and earlier years. The data set is now substantial enough to support meaningful year-on-year comparison; the patterns across multiple years tell a clearer story than any individual year.

The annual shape

In rough numbers, comparing 2004 with 2003:

Total compromise attempts: roughly 2x increase year over year.
Distinct source IPs per month: 30-40% increase.
Sebek captures of human-attacker activity: 14 sessions for 2004, comparable to 2003.
Mail-borne propagation attempts: roughly 4x increase, dominated by MyDoom and successors.

The overall volume continues to grow. The mix is shifting; the structural elements are stabilising.

Quarterly breakdown

Q1 2004: Dominated by Bagle and MyDoom. The mass-mailing volume was extraordinary — roughly 4x the previous baseline. Other attack categories were roughly flat.

Q2 2004: Sasser saturated mid-quarter. The RPC-based scan traffic spiked; the post-saturation tail continues throughout the year. Cabir appeared but with no honeypot impact (no Symbian devices in my range).

Q3 2004: Relatively quieter. Various smaller worm activity. The cumulative residual scan rate continued at elevated baseline. JPEG vulnerability advisory in September produced some scan-pattern shifts but no specific worm.

Q4 2004: Continued elevated baseline. The Bagle/MyDoom/Netsky war variants continue producing mail volume. Phishing-related traffic grew measurably across the quarter.

Attack mix evolution

The mix continues shifting:

HTTP-targeted: 50% (down from 55% in 2003).
SMB/NetBIOS: 22% (up from 18%).
Mail-borne probes: 12% (up from 8%).
SSH brute-force: 8% (up from 7%).
Other: 8% (down from 12%).

The HTTP decrease reflects, I believe, the Microsoft Trustworthy Computing progress — IIS 6's restrictive defaults reduce the deployed vulnerable population. The SMB increase reflects the Sasser-style and Blaster-style worm trajectory; SMB is now a primary target.

The mail-borne increase is the MyDoom effect. Mass-mailing volume is dramatically higher than in previous years.

The SSH brute-force increase is concerning but not yet dramatic. Specific patterns I observe include: distributed brute-force from many sources targeting the same destination, sustained low-rate attempts from individual sources, and credential-list-based attacks rather than dictionary attacks. The professionalisation visible in phishing is starting to be visible in SSH brute-force.

Sebek captures

The high-interaction host caught 14 sessions of human-attacker activity in 2004. The breakdown:

8 enumerate-and-leave (57%, similar to historical baseline).
4 attempted persistent backdoor installation (28%, up from 20% in 2003).
1 attempted credential exfiltration.
1 unusual or specialised activity I am still analysing.

The persistent-backdoor proportion has grown. This is consistent with the broader observation that compromised hosts are being treated as strategic assets — attackers who get a foothold are more likely to invest in keeping it.

The specific persistent-backdoor techniques observed include: kernel rootkit installation (one capture, careful and methodical, similar to my 2002 careful-attacker capture), userland backdoor with cron persistence (two captures), and a clever variant where the attacker modified an existing system binary to add backdoor functionality (one capture, the unusual activity I am still analysing).

Specific notable captures

Three captures from 2004 that I will write up more fully in subsequent posts.

The methodical kernel-rootkit deployer. A skilled attacker who spent two hours over three sessions deploying a careful kernel rootkit. The Sebek capture has the full source code of their tools; the techniques are sophisticated. This may be the same attacker pattern I have seen before; the timing patterns suggest yes, but I cannot be certain.

The credential-harvesting session. A different attacker who, on gaining shell access, immediately ran a script that harvested every email address, every cached credential, every plausibly-interesting file from the host. The harvesting was automated; the data was packaged into a tar archive; the exfiltration attempt failed (my outbound filtering blocked it). The shape of the harvesting suggests connection to phishing operations rather than to malware authorship.

The unusual binary-modification capture. The attacker who modified an existing system binary rather than installing a new one. Specifically, they replaced /bin/login with a modified version that recorded passwords as users authenticated. The modification was detectable from outside the host (the binary's hash differed from the package version) but invisible from on-host inspection. The technique is generic; the specific implementation was clean.

What this teaches

Three observations.

The threat-actor population is professionalising. The persistent-backdoor proportion is growing. The credential-harvesting techniques are mature. The binary-modification attack shows skilled tradecraft. The amateur-curious attacker population is being displaced by the commercial-and-skilled population.

The detection requirements are growing. On-host detection cannot find skilled attackers any more; off-host detection can. The Sebek-style and structured-log disciplines I have been advocating are no longer optional for serious operators.

The compromise is increasingly persistent. Hosts that get compromised tend to stay compromised. The cumulative population of compromised hosts grows year over year; the substrate for further attacks grows with it.

What I am doing differently in 2005

Three changes:

Expanding the honeypot range. From /28 to /27. The additional addresses provide more breadth for observation; the cost is modest.

Adding two more Honeyd personas. Specifically: a Mac OS X server persona (the platform is increasingly on the radar) and an embedded-router persona (the home-router category is starting to be targeted).

Improving the analysis pipeline. The structured-log database is reaching the scale where simple queries are slow. I am investing in better indexing and pre-aggregation. The migration is bounded but not trivial.

The broader context

Four years of honeypot data now. The cumulative archive is substantial; the patterns visible across years are more meaningful than any single year's data.

The most striking single observation from the cumulative data: the baseline attack volume keeps stepping up. Each major worm event contributes a residual that does not fully disappear. The baseline of 2005 is roughly 4x the baseline of 2001; the baseline of 2006 will probably be 5-6x.

This matters operationally. Defenders who tuned their filters and alerts to a 2002 baseline are now seeing many more events than they tuned for. The alerting volume grows; the false-positive cost grows; the operator attention required grows. Defensive infrastructure that has not been re-tuned for the current baseline produces more noise than signal.

What I expect for 2005 honeypot data

Three predictions:

Continued elevated baseline. Worm residuals from previous years will continue contributing. Specific volume increase of perhaps 20-30% over 2004.

A new persistent worm. I expect at least one major worm in 2005 that adds to the persistent compromised-host population. The category is established; the next instance is overdue.

Increased brute-force traffic. SSH and other authentication brute-force will continue growing. The professionalisation visible in phishing and DDoS-for-hire is extending into credential-targeting.

More as the year develops.

A reflection on sustained observation

The honeypot has been running, with varying configurations, since late 1999. Roughly six years of continuous observation. The cumulative data is, by some distance, the longest-running observation set I have access to.

The specific value of long-running observation: trends become visible that are not visible in short windows. The shifting attack mix, the elevated baseline, the persistent compromise — all of these are visible only because the observation window is long.

For anyone considering starting a similar observation discipline: the value is in the duration. A honeypot run for six months produces snapshot data; a honeypot run for six years produces trend data. The trend data is meaningfully more valuable.

A more comprehensive view of the cumulative honeypot data

Let me extend this annual recap with a longer view across the cumulative honeypot data. The single-year summaries are useful; the multi-year patterns are more informative.

Six years of trends

The honeypot has been running, with varying configurations, since late 1999. The cumulative dataset now spans six years of continuous observation. The patterns visible across the full duration:

Volume continues to grow. Each year's baseline is higher than the previous year's. The growth is roughly exponential, with major-incident years adding additional persistent volume that does not fully decay.

The attack mix continues shifting. Different categories dominate different years. The early years were dominated by RPC and NetBIOS scans; the middle years by HTTP-targeted attacks; the recent years by mass-mailing and SMB.

The threat-actor population is professionalising. The proportion of automated attacks has grown; the proportion of curious-amateur attackers has shrunk; the commercial-cybercrime population is the dominant actor type.

The persistence problem is structural. Each major worm leaves residuals that contribute to subsequent years' baseline. The cumulative persistent population grows year over year.

What the data shows about specific incidents

The major-incident periods are clearly visible in the data:

Q3 2001: Code Red and Nimda — a substantial spike that took months to subside.
Q1 2003: SQL Slammer — a brief but extreme spike.
Q3 2003: Blaster, Welchia, Sobig.F — sustained elevated activity for the quarter.
Q1 2004: MyDoom — substantial mass-mailing volume.
Q2 2004: Sasser — RPC-targeted spike.
Q3 2005: Zotob — modest spike, well below historical major-incident levels.

The trajectory of major-incident magnitude is, intriguingly, decreasing. 2001-2004 had multiple very-large-scale events; 2005 has had none of comparable magnitude.

Why the major-incident magnitude is decreasing

Three hypotheses, each partially supported:

The defensive baseline has improved. Mature filtering, faster patching, better-default operating systems. Microsoft Trustworthy Computing has produced measurable improvements in the deployed-base security posture. Worms that worked in 2001 do not work as well in 2005 against the upgraded population.

The threat infrastructure has shifted toward more specific targeting. Mass-propagating worms with broad impact are less profitable than targeted attacks against specific high-value populations. The threat actors have followed the economics; broad worms are less common.

The compromised-host substrate is large enough that new compromises are less needed. Cumulative compromise from previous years has produced a substantial pool of compromised hosts. New compromises are still happening but the marginal value of each is lower; the economic incentive to launch large-scale broad worms is reduced.

All three are probably contributing. The relative weights are uncertain.

The Sebek capture archive

The high-interaction Sebek captures have accumulated to roughly 60 sessions across the duration. The cumulative archive is, by my assessment, the most valuable single output of the honeypot — these are detailed records of human-attacker activity that are essentially impossible to obtain through other means.

The Sebek captures group into recognisable categories:

The script-kiddie pattern (~50% of captures). Automated tooling, predictable command sequences, fail-and-leave behaviour. Operationally low-value but informative for baseline.

The moderate-skill enumerator (~25%). Manual investigation, careful enumeration, attempt to deploy tools, fail at outbound filtering, leave. The bulk of human-attacker activity.

The careful-attacker pattern (~15%). Multiple sessions, surgical deployment, specific tradecraft. The most operationally valuable captures; relatively rare.

The commercial-cybercrime pattern (~10%). Specific intent (spam relay, credential harvesting, phishing infrastructure). Growing fraction of recent captures.

The pattern distribution has shifted over time. Early captures were dominated by amateur-curious attackers; recent captures are dominated by commercial-cybercrime patterns.

What I will do with the cumulative data

Three specific things over the next year:

A retrospective paper for the Honeynet Project. The cumulative six-year dataset is unusual; the paper would benefit the broader research community.

Continued contribution to the project's cross-operator analyses. My data is one input; the cumulative cross-operator data is more valuable than any individual operator's.

Continued public writing about specific captures. Sanitised case studies of interesting individual captures continue to be valuable for readers.

What I would tell another operator considering similar work

The value of long-running honeypot observation compounds with duration. A short-duration deployment produces snapshot data; a long-duration deployment produces trend data; trend data is meaningfully more valuable.

The operational cost is bounded. After initial setup, maintenance is roughly an hour per week. The cumulative time investment over years is modest.

The community contribution is real. Honeypot operators who share their data (in sanitised form) contribute to a research substrate that benefits everyone.

For anyone who has been considering setting up a honeypot but has not: the Honeyd-based deployment is approachable. The barrier is lower than it was a few years ago. The benefits compound over years.

More as 2005 wraps up and 2006 begins.

A reflection on data ownership and contribution

The cumulative honeypot data raises questions about ownership and contribution that are worth thinking through.

The data is, in the strict sense, mine. I own the infrastructure; I designed the deployment; I do the analysis; I publish the results. The legal ownership is unambiguous.

The data is also, in some real sense, not mine. The activity captured is generated by external actors; the value of the data depends on it being shared with the broader research community; the analysis benefits from cross-operator comparison.

The right framing, I have come to think, is stewardship. I steward the data on behalf of the broader research community; the public outputs (sanitised case studies, contributions to Honeynet Project papers) are the realisation of that stewardship.

For anyone considering similar work: the stewardship framing is probably the right one. The legal ownership is not the operationally important property; the contribution to the broader field is.

The specific things I do with the data:

Public writing. Sanitised case studies through this notebook. The public benefit is real; the operational cost is bounded.

Contributions to research. The Honeynet Project's cumulative analyses include contributions from my range. The collective value of multi-operator data is much greater than any single operator's data.

Personal calibration. The data informs my own predictions and operational choices. The personal value is real but secondary to the public-and-research value.

Archive for future analysis. The cumulative data may be valuable for analyses I have not yet conceived. Maintaining the archive in usable form is a long-term commitment.

The stewardship is sustainable. The discipline continues.

More as the cumulative archive grows.

A note on cumulative observation as research

Let me close this annual recap with brief reflection on cumulative observation as a research methodology.

The security research community does too much short-window analysis and too little long-window analysis. Most published research is based on weeks or months of data; comparatively little is based on years.

This is structurally unfortunate. Many of the most interesting security patterns are visible only over multi-year windows. Threat-actor population shifts, defensive maturity progression, the persistence of compromised hosts across worm generations — none of these are visible in weeks-of-data studies.

The reasons for the bias toward short-window research are organisational. Academic researchers face publication pressures that favour faster results. Commercial researchers face product cycles that favour shorter experiments. Independent researchers (like me) face attention spans that favour current events.

The long-window research that does exist is therefore disproportionately valuable. The Honeynet Project's cumulative-analysis paper is one example; my own honeypot's six-year archive is a smaller example; specific operational studies at large enterprises produce cumulative data over years.

For my own contribution: I will continue to write about the cumulative patterns over time. The specific incident write-ups continue; the cumulative reflections accumulate; the cross-year comparisons are increasingly the most interesting writing.

For anyone considering similar work: the long-window discipline is undervalued. The investment is bounded; the cumulative output is uniquely valuable.

More as the years accumulate.