Logfile rotation, retention, and the things you only learn after losing logs

Last Tuesday I needed a logfile entry from about three weeks ago. The entry would have told me whether a particular incoming connection had succeeded or been refused. I went to find it. It was gone.

The logfile in question had been rotated, compressed, rotated again, and finally pruned by the standard log-rotation script that ships with most Linux distributions. The default retention was four weeks. The entry was outside that window.

This is a small disaster. I want to write about it because the lesson is bigger than "set retention longer".

The default policies and where they come from

Most Linux distributions ship with a logfile rotation tool — usually logrotate — configured with reasonable-sounding defaults. A typical default for /var/log/messages rotates the log weekly, keeps four weeks, compresses old rotations.

This is fine for the original purpose: keep the log file from filling the disk. It is not fine if you ever want to investigate something more than a month old.

The default exists because log-rotation tools were designed in an era where storage was expensive and logs were rarely consulted retrospectively. Both of those conditions have changed. Storage is cheap. Investigation is increasingly the point.

What I have changed

For my own systems, I now have the following retention rules, which are about ten times the default:

Mail logs: 12 months, compressed weekly.
Web access logs: 12 months, compressed weekly.
Auth logs: 24 months, compressed weekly.
Kernel and syslog: 6 months, compressed weekly.
Firewall drop logs: 12 months, compressed weekly.

This is a lot more storage than the defaults imply. On my home box it is about two gigabytes, which is significant on the disks I have. It is not significant on any system bought in the last two years. The cost-benefit is heavily in favour of keeping more.

The values themselves are arbitrary; what matters is that I have decided what they should be, written it down, and tested that the actual rotation matches the policy.

What logrotate does, mechanically

The logrotate config for a single log looks like:

/var/log/messages {
    weekly
    rotate 52
    compress
    delaycompress
    notifempty
    missingok
    sharedscripts
    postrotate
        /etc/init.d/syslog reload >/dev/null 2>&1 || true
    endscript
}

The options worth understanding:

weekly: rotate once a week, regardless of size.
rotate 52: keep 52 generations. Combined with weekly, this is 52 weeks.
compress: compress all but the most recent rotated file.
delaycompress: do not compress the just-rotated one yet; the daemon may still have an open handle on it.
notifempty: do not rotate empty logs.
missingok: do not error if the file is missing.
sharedscripts: run the postrotate script once per matching file pattern, not once per file.
postrotate: HUP the daemon so it reopens its log file. Without this, the daemon continues writing to the old (now renamed) file, and the new logfile remains empty.

The postrotate step is the one that bites you most. If you forget it, the rotation appears to work — the file gets renamed, a new empty file is created — but the daemon is still writing to the old name. You discover this a week later when you wonder why the new file is empty.

The thing I now do that is more important than retention

Off-host log shipping.

Log rotation on the original host is necessary but not sufficient. If the host is compromised, the attacker can rewrite or delete the local logs. The local logs are evidence; the local logs are also under the same control as the thing being investigated.

The answer is to ship the logs, in real time, to a different host. The receiving host should be one whose only job is logging — minimal services, minimal exposure, written-only access to the log store.

The ancient and reliable way to do this is syslog's remote forwarding. In /etc/syslog.conf:

*.*    @logserver.example

This sends a copy of every syslog message to UDP 514 on logserver.example, in addition to whatever local logging happens. The logserver runs syslogd -r to accept remote messages and writes them to its own files.

The weakness is that this is UDP, which means messages can be lost in transit (and there is no integrity protection). For my home setup this is acceptable. For anything serious, you would want TCP-based syslog — syslog-ng is the obvious replacement.

What I lost, and what I learned

The specific entry I lost was a connection-attempt log line from a specific IP that I now suspect was associated with a probe my honeypot caught on a different port. With the original line I could correlate the activities. Without it, I have to assume the correlation but cannot prove it.

This is the kind of small loss that, on a serious investigation, would matter materially. The amount of data needed for a defensible incident reconstruction is a lot more than people think before they have done one. Three to six months is probably the floor.

The more general lesson, which is the one I am writing down to remember: defaults written for an old set of constraints persist into a new set of constraints. Every defaults file in any Unix system was written with assumptions about cost, threat model, and use case. Some of those assumptions are now wrong. The default logrotate config is one. The default firewall policy on most distributions is another. The default services running after install is a third.

Reviewing defaults is its own ongoing discipline.