Proper backups, with examples

Two years ago I had no backup discipline. Last year I had a bad one. This year, after one near-miss and a lot of reading, I think I have an adequate one. The progression is worth writing down because most operators I have spoken to are at one of the earlier stages and could probably benefit from skipping ahead.

Stage zero: no backups

The first six months of running a Linux box involved no backups at all. The thinking was, broadly, "the data is in places I could rebuild, the system is reproducible from the install media, and I do not have anything irreplaceable".

This was wrong on every count.

What actually lived on the box and could not be reproduced from elsewhere: my mail spool with two years of correspondence; my logfiles with every interaction the system had had; my custom firewall rules that had been built up incrementally; my SSH known_hosts and authorized_keys; my scripts in ~/bin that wrapped operational tasks I did frequently.

None of this was in CVS. Most of it I had not deliberately preserved. All of it was, in some real sense, intellectual capital that had taken months to build.

The near-miss came when I accidentally rm -rf'd my home directory. I had a moment of panic, then realised I had — by accident — copied the directory to a remote machine the previous week as part of an unrelated experiment. That copy was the entire reason I did not lose two years of work.

Luck. Pure luck. The next day I started Stage One.

Stage one: weekly tar archives

The first iteration was a weekly cron job that produced /var/backups/$(date).tar.gz containing the home directories, the customised parts of /etc, and the mail spool.

This was a clear improvement on nothing. It also had several flaws that took me a while to notice:

The archives were on the same disk as the data. A disk failure would lose both.
Each weekly archive was a full snapshot. After three months, I had thirteen 800-MB files; the disk was filling up.
The archives were never tested for restoreability. I had no idea whether they could be opened, let alone whether the contents were complete.
The archive process did not include /etc/passwd, the database from the small applications I ran, or anything in /var/lib. It captured what I had thought to include rather than what was actually important.

Stage one would have saved me from the rm -rf problem only weakly. It would have lost most of a week's work. It would not have helped at all against a disk failure.

Stage two: rsync to a remote machine

The second iteration moved the backups off-host. A nightly cron job ran rsync to push changes to a separate machine I had access to.

rsync -aHv --delete --exclude=*.tmp /home/ backup-host:/backups/$(hostname)/home/

The -a flag is archive mode (preserves permissions, ownership, timestamps, symlinks). The -H flag preserves hard links. The --delete flag removes files from the destination that are no longer in the source — keeping the backup in sync rather than monotonically growing.

This was a real improvement. The off-host part was the obvious win. The incremental nature meant the backup completed in minutes rather than hours after the first full sync.

It had, however, one terrible flaw: with --delete, the backup mirrors the source. If I deleted a file on the source and the backup ran before I noticed, the file was also deleted on the backup. The Stage Two backup protected against disk failure but not against operator error, which is the more common failure mode.

Stage three: rsnapshot, with versioning

The current iteration uses rsnapshot — a wrapper around rsync that maintains multiple generations of the backup using hard links.

The principle is clever. Each generation of the backup is a directory tree. Files that are unchanged between generations are hard-linked between the directories — so the disk space used is approximately the size of the data plus the deltas. If I want last Tuesday's version of a file, I look in the Tuesday backup directory; if the file existed and was unchanged that day, the hard link points to the same data as today's backup.

The configuration retains:

Hourly snapshots for the last 24 hours.
Daily snapshots for the last 7 days.
Weekly snapshots for the last 4 weeks.
Monthly snapshots for the last 6 months.

Total disk usage is about three times the size of the live data, which is a price I am willing to pay for being able to look at the system as of any time in the last 24 hours, the last 7 days, the last 4 weeks, or the last 6 months.

The restoration procedure is cp from the appropriate snapshot directory. There is no decoding step, no archive format to learn, no special tooling required. If I want a specific file from yesterday morning, I just cp it from hourly.6/path/to/file.

What I have actually had to restore

In the last 12 months I have used the backups for restoration four times.

Once for a configuration mistake. I had broken sendmail by editing the wrong file. The previous day's /etc/mail/sendmail.cf came back from the daily snapshot, and a restart fixed it.
Once for an accidental delete. I had rm -rf'd a project directory I thought was elsewhere. The hourly snapshot from a few hours earlier restored it intact.
Once for what looked like a corruption. A SQLite database had stopped working. I restored the previous hour's snapshot, found the same corruption, restored a daily snapshot from before the corruption appeared. It turned out the corruption was in a write that had partially completed; restoring from before fixed it.
Once for what was, in retrospect, a successful test. I deliberately deleted a file to confirm the backup worked. It did.

The restoration interface — cp from the snapshot — is simple enough that I do not worry about whether I will remember how to use it. The hard part is the discipline of making sure the snapshots themselves are working.

The discipline of testing the backups

This is the part I keep coming back to. A backup that has not been tested is not a backup. It is an unvalidated belief.

My current testing discipline is:

Weekly: a small test restore. Pick a random file from a backup directory and verify its contents against the live version (if it still exists) or against my expectations (if it does not). The exercise takes five minutes.

Monthly: a full simulated restore. Pick a non-trivial subdirectory, restore it to a scratch location, and verify the restored copy is functional — the application can be started against it, the config files parse, the data has the structure I expect.

Quarterly: a complete restore drill. Take a host out of service, wipe its disk (deliberately), and restore from backup. Verify the host comes back to the same state. This is more disruptive than the monthly version and I cannot do it on every host.

None of this is glamorous. All of it is the difference between having backups and having an unverified hope.

What I would change if starting over

Three things, looking back at the progression:

Start with rsnapshot or equivalent. Skip the tar-archive and plain-rsync stages. The versioned-snapshot approach is the right answer; the cost of getting there directly is low.

Identify what is actually irreplaceable on day one. Make a list. Reread the list weekly. The list will be longer than you expected; that is fine; you just need to know what you have.

Test the restore from day one. The first thing you should do after configuring backups is restore from them. Not three months later when you have grown to trust the configuration; immediately, before you trust anything.

The single most important thing the discipline has changed about my operational mindset is the confidence with which I make changes. I now apply firewall changes, software upgrades, and experimental configurations knowing that, if I break something, I have a recent snapshot to compare against. The discipline is not about disaster recovery primarily. It is about reducing the cost of being wrong.

Which is, in some real sense, what good operational discipline always is.