Build your own log scanner: an evening's project

There are commercial log-analysis tools. They cost real money — sometimes tens of thousands of dollars per year. They do, in fairness, a great many things.

Eighty per cent of what they do can be replicated in an evening with Perl, the standard regular expression library, and a sensible structure. I wrote one this evening. This post walks through how it was built and what I have learned from running it for a fortnight.

The goal, narrowly defined

A log scanner needs to do four things:

Read log files, including compressed historical ones, in the right time order.
Match each line against a configurable set of patterns.
Count or aggregate matches by useful properties (source IP, user, URL, etc).
Produce a digest that can be emailed or read in five minutes.

This is not, by any stretch, a comprehensive monitoring system. It is the small, useful subset that catches the obvious problems.

The structure

The scanner is configuration-driven. The config is a simple text file: pattern, what to extract, what to call it.

# /etc/logscan.conf
failed_login   /Failed password for (\S+) from (\S+)/   user=$1 src=$2
404_response   / 404 \d+$/                              code=404
phf_probe      m{/cgi-bin/phf}                          probe=phf
long_url       m{GET \S{200,}}                          buffer_overflow

Four columns: a name for the pattern, the regex, and a list of fields to extract. The Perl code reads the config, compiles the regexes, and applies them line by line.

The scanner itself is short:

use strict;
use warnings;

my %patterns;
open(my $cfg, '<', '/etc/logscan.conf') or die $!;
while (<$cfg>) {
    chomp;
    next if /^\s*(#|$)/;
    my ($name, $re, @fields) = split /\s+/, $_, 3;
    $patterns{$name} = { regex => qr/$re/, fields => \@fields };
}

my %counts;
for my $log (@ARGV) {
    my $fh;
    if ($log =~ /\.gz$/) { open($fh, '-|', "zcat $log") or die $!; }
    else                  { open($fh, '<', $log) or die $!; }
    while (my $line = <$fh>) {
        for my $name (keys %patterns) {
            if ($line =~ $patterns{$name}{regex}) {
                $counts{$name}{total}++;
                # Capture groups are now in $1, $2 etc - record them
                # ... left as exercise
            }
        }
    }
}

for my $name (sort keys %counts) {
    printf "%6d  %s\n", $counts{$name}{total}, $name;
}

Thirty lines, including comments. Run it with ./logscan /var/log/messages /var/log/messages.1.gz and it spits out a tally per pattern.

This was the version I wrote in the first hour. It is not what the version looks like a fortnight later, but it is enough to start being useful.

What the actual version does extra

The naive version above counts. The version I run now does a few more things, all of which were responses to a real frustration:

It groups by extracted fields. If a pattern has named captures, the counts are kept per distinct value. So failed_login produces a per-user, per-source count, not just a total. This is the thing that lets me see "all 8,412 failed logins are for user root from a single source" rather than "8,412 failed logins".

It bounds the report to a date range. Default: last 24 hours. Override with --since and --until.

It supports threshold-only output. --min-count 10 excludes patterns that have fewer than 10 hits. This is the difference between a digest with 200 lines (most of which are single events) and a digest with 12 lines that highlight what is actually unusual.

It runs from cron and emails me. Every morning at 6 AM, the previous 24 hours of logs are summarised and mailed to me. The reading takes 90 seconds.

What it has actually caught

A fortnight of running this on my own modest infrastructure has caught:

Two days when failed login attempts to SSH spiked to over a thousand from a single IP. (A scanner running through a list of common passwords. Failed, harmlessly, because I disabled password authentication.)
A handful of GET requests with very long URLs, characteristic of buffer overflow attempts against IIS. (Not directed at me — I do not run IIS — but the scanner clearly does not check what server it is hitting first.)
One genuinely interesting case where a script kiddie was probing for vulnerable versions of about 30 different CGI scripts in turn from the same IP. The pattern of probes told me they were running a mass scanner; the timing told me they were going through a long list slowly to avoid rate-limiting.

None of these are emergencies. All of them are the kind of activity a sensible operator wants to be aware of. Without the scanner, I would not have been.

What this kind of tool is not good at

Three things, in increasing order of severity.

Real-time alerting. This is a daily-batch tool. If something needs to wake you up at 3 AM, you want a different system. The right architecture is a fast in-line filter (something like tail -F | grep) that pages on truly urgent patterns, plus this kind of digest for everything else.

Behavioural baselines. The scanner cannot tell that yesterday was unusual unless you encode "unusual" as an explicit threshold. Most genuinely interesting attacks would not match a static threshold; they look normal in absolute terms but unusual compared to the baseline for that user/host/time. Real intrusion detection — the academic kind — has been working on this for years and is hard.

Cross-log correlation. Each log is processed independently. If an attacker is probing me on the web, then trying SSH, then sending mail spam, three different logs each show one phase. The scanner does not connect them. Doing so well requires a more sophisticated data structure than a per-pattern count.

For the level of threat I am facing, the scanner is enough. For something more serious, the next stop is real intrusion detection, with Snort on the network and a more sophisticated analysis layer above the logs.

A small philosophical aside

The interesting thing about writing this scanner was not the writing. It was the choosing.

Deciding what patterns to alert on, what thresholds to set, what fields to extract — these are the actual work. The Perl is incidental. The same exercise in any other language — Python, awk, even shell — would produce a tool of similar utility. The discipline is in figuring out what you actually want to know.

This is the part that the commercial tools do not give you. They come with hundreds of preset rules; the rules are mostly fine; deciding which of them to enable for your environment is still your job. The tool only saves you the writing of the regexes.

Which is, at the end of the day, the easy part.