Building a small honeypot from scratch

After a weekend with the Deception Toolkit last summer, I decided to write my own. Not because DTK is in any way inadequate — it is not — but because the only way I really learn a tool is to build a primitive version of it myself.

This post is about that build. The design decisions, the things I changed my mind about, and the surprisingly small amount of code that turned out to be needed.

The design goal

I wanted something simpler than DTK in scope, but more aggressive in what it captured.

Specifically: emulate a small handful of services well enough to fool an automated scanner, but capture every byte of every interaction in a form I could later replay or analyse. DTK logs interactions; my version was going to log full packet captures plus the application-layer view in a single correlated record.

I wrote it in Perl because that is the language I was most fluent in at the time, and because the standard library has IO::Socket which makes the network glue trivial.

The core loop

The core of the honeypot is a forking listener. The parent waits on a socket. When a connection arrives, it forks; the child handles the connection and exits.

use IO::Socket;
use POSIX ':sys_wait_h';

$SIG{CHLD} = sub { while (waitpid(-1, WNOHANG) > 0) {} };

my $listener = IO::Socket::INET->new(
    LocalPort => $PORT,
    Listen    => 5,
    Reuse     => 1,
) or die "Could not bind: $!";

while (my $client = $listener->accept) {
    my $pid = fork();
    die "fork failed: $!" unless defined $pid;
    if ($pid == 0) {
        $listener->close;
        handle($client);
        exit 0;
    }
    $client->close;
}

The $SIG{CHLD} handler is the bit that bites you the first time. Without it, every finished child becomes a zombie and you eventually run out of process slots. The waitpid -1, WNOHANG reaps any finished child without blocking.

The handler

Each service is a Perl module with a single handle($client) function. It reads from the socket, writes plausible responses, and logs everything to a structured logfile.

The FTP module looks like, roughly:

sub handle {
    my ($s) = @_;
    log_event($s, 'connect');
    print $s "220 ProFTPD Server (some bored sysadmin) [vfx]\r\n";
    while (my $line = <$s>) {
        $line =~ s/[\r\n]+$//;
        log_event($s, "in:$line");
        if    ($line =~ /^USER (.+)/i)    { print $s "331 Password required for $1.\r\n"; }
        elsif ($line =~ /^PASS (.+)/i)    { print $s "530 Login incorrect.\r\n"; }
        elsif ($line =~ /^QUIT/i)         { print $s "221 Goodbye.\r\n"; last; }
        else                              { print $s "500 Command not understood.\r\n"; }
        log_event($s, 'out:replied');
    }
}

This is a bad FTP server. Deliberately. It accepts USER, refuses every PASS, refuses every other command. The point is to keep the attacker engaged for long enough to produce a useful log entry, not to fool a sophisticated client into thinking it is talking to a real ProFTPD.

More importantly: every line in and every line out is logged with a timestamp, the connecting IP, and the connection identifier.

The logging discipline

This is where I diverged most sharply from DTK.

DTK's log format is human-readable, designed to be skimmed. Mine is structured: one line per event, with a fixed set of fields. Something like:

1999-01-23T15:42:11Z 203.0.113.14:54312 ftp connect
1999-01-23T15:42:13Z 203.0.113.14:54312 ftp in:USER admin
1999-01-23T15:42:13Z 203.0.113.14:54312 ftp out:331
1999-01-23T15:42:14Z 203.0.113.14:54312 ftp in:PASS password
1999-01-23T15:42:14Z 203.0.113.14:54312 ftp out:530

This is grep-able, awk-able, and trivially parseable into any other format. The cost is verbosity. The benefit is that I can run the same log analysis idioms over the honeypot logs as I do over Apache's, with no special tooling.

In parallel, the honeypot writes a tcpdump packet capture of every connection, in pcap format. The application log gives me the human view of what happened; the pcap gives me the bytes if I want to look closer.

What it caught in the first week

I ran it on three ports — 21 (FTP), 23 (Telnet), 110 (POP3) — on a public IP for a week. I did not advertise the box. Total interesting catch:

134 FTP login attempts, almost all from automated scanners trying common usernames (admin, anonymous, ftp, root). Only two of them tried real passwords (rather than blank), which suggests the scanners are mostly checking for completely-default-credential boxes.
48 Telnet sessions, many of which immediately tried BREAK or other escape sequences before being logged out.
6 POP3 login attempts, all using the same username/password pair (admin/admin), which suggests a single source running through a list.

None of this is a serious threat. All of it is data I would not otherwise have had. And the patterns — particularly the fact that the FTP scanners do not even try real passwords, just defaults — tell me something useful about what defenders should worry about: weak default credentials are the killer attack surface, not novel exploits.

Things I would change

Three things, which I will probably do over the next month:

One, the service emulation is too thin. A real attacker would notice within seconds that this is not a real FTP server. Adding a few more commands, even just HELP and STAT, would make the deception last longer.

Two, the connection limiting is naive. A determined attacker could fork-bomb me by opening hundreds of simultaneous connections. Rate-limiting per source IP — at the kernel firewall level — is the right place to fix this.

Three, I want to add an alert mechanism. Right now I have to read the log to know if anything interesting happened. A small daemon that watches the log and emails me when a session is unusually long or contains unusual content would close the loop.

What it has taught me already

The biggest lesson is structural. A honeypot is, in a real sense, just a fake service plus very good logging. The deception part is decorative. The logging discipline is where all the value lives. If I were redoing this I would put more time into the log format and less into making the FTP banner convincing.

The second lesson is that the threat landscape, viewed from a fresh box on a public IP, is mostly boring. The same scanners. The same default-credential probes. The same five or six exploit patterns. Sophisticated attacks exist; they just are not what hits a random IP, most of the time.

If I want to see more interesting traffic, I am going to need to advertise the box. Which is the next experiment.