Logfile analysis with awk and grep

Most of what I do, day to day, is read log files. This is not glamorous and the security press does not write about it. It is, nonetheless, where most of the actual security work happens — including, weeks later, when an incident is being reconstructed.

The tools for this on a Unix box are essentially grep, awk, sed, and the shell. They are not new. They have not changed much in twenty years. They are, on the available evidence, the right tool for the job.

The shape of a logfile

Most Unix log files are line-oriented. One event per line. Whitespace-separated fields. A typical Apache access log line looks like:

203.0.113.7 - - [22/Aug/1998:12:01:33 +0000] "GET /index.html HTTP/1.0" 200 1623

IP, ident, user, timestamp, request, status, bytes. Read like a sentence. The format is regular enough that simple field-extraction tools work on it without much fuss.

This is the killer property. Almost everything I want to do is some combination of: "find the lines matching X", "pull out field N from those lines", "count how often each value of field N appears".

The four-line idiom

The single most useful thing I have learned this year is the following pattern:

grep PATTERN logfile | awk '{print $FIELD}' | sort | uniq -c | sort -rn

This says: filter the lines you care about, extract the column you want, count distinct values, sort by count. It does not look like much. It is the foundation of every ad-hoc log analysis I have ever done.

For example, to find out which IPs hit my web server most often last week:

grep '23/Aug/1998' access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head

In a couple of seconds this prints a list of the top talkers. If one of them is anomalously high — say, fifty thousand requests from a single IP — that is something I should look at.

Specific things I keep an eye on

Over time I have built up a small mental list of patterns worth grepping for in access.log:

Requests for /cgi-bin/phf, which is the canonical "is this a vulnerable web server" probe at the moment.
Repeated 401 or 403 responses, which usually mean someone is trying credentials on a protected area.
Long URLs containing .. segments, which are path traversal attempts.
POSTs to scripts I do not have, which suggests someone is running a generic exploit kit against me.

None of these are sophisticated. All of them are visible in five seconds with the right grep.

The same idea applied to syslog

grep, awk, and friends do not care which logfile they are looking at. The same idiom works on syslog:

grep 'Failed password' /var/log/auth.log | awk '{print $11}' | sort | uniq -c | sort -rn

This pulls out the top usernames being tried in failed SSH-or-similar logins. If "root" is at the top of that list, you have someone running a brute-force attempt and you should think about disabling root login over the network.

What I am working towards

The ad-hoc one-liners are useful but they are not a system. I am slowly accumulating a small directory of shell scripts that wrap the patterns I run most often, with sensible defaults — date-bounding, top-N selection, output formatting. The eventual aim is to have a five-minute morning routine that prints a digest of yesterday's interesting log activity, automatically.

This is, I gather, what people in the academic intrusion-detection community are formalising. The idea of having a daemon that watches log files in real time and raises an alert when patterns appear — a sort of expert system for log lines — is at the centre of that emerging field. I am keeping an eye on it. The basic ideas, though, are right here in the Unix toolkit, and any time I spend learning to drive awk better will pay back for as long as I work with these systems.