MRTG and the discipline of graphing what you measure

I have been running MRTG — Tobi Oetiker's Multi Router Traffic Grapher — on my home network for about three months. It produces small line-graphs of traffic counters from any SNMP-speaking device. The graphs are not pretty by modern standards. They are, however, the most useful operational tool I have installed since I started.

I want to write about MRTG specifically and about the discipline it has forced on me, which generalises beyond the specific tool.

What MRTG does

Mechanically, MRTG is small. A cron job runs every five minutes and polls SNMP counters from the devices you have configured. It writes the values into compact log files. A separate run regenerates HTML pages with embedded graphs from those logs.

The graphs show four time-scales by default — last 24 hours, last week, last month, last year. The most recent points are exact; the older points are aggregated. The result is a set of small images that show you, at a glance, how each measured value has behaved over time.

The original use case was network traffic on routers — bytes in and bytes out per interface. MRTG can graph anything an SNMP counter can produce, which is essentially anything. People graph CPU usage, memory usage, disk usage, queue depth, mail volume, login attempts. The tool is general.

What it has actually changed for me

Three things, in order of how often they bite.

I now know what 'normal' looks like for my network. Before MRTG, I had a vague sense that my outbound bandwidth peaked in the evenings. I knew this because I had occasionally noticed slow downloads. After MRTG, I can read off, for any hour of any day of the past month, exactly what the traffic was. This means anomalies are visible. A spike in outbound traffic at 3am on a Tuesday is now obvious, where before it would have been invisible.

I can correlate cause and effect across the network. When something feels slow, I can look at the graphs for the router, the file server, and the workstation simultaneously, and see whether the bottleneck is one of them or all of them or none of them. This kind of multi-point view is essentially impossible without persistent measurement.

I have a reasonable historical record for incident reconstruction. A month ago I had a brief mail outage. The graph showed exactly when it started, exactly when it ended, and that during the outage the SMTP queue depth had spiked while the inbound rate was normal. The cause was, on inspection, a broken DNS lookup that was holding up the queue. Without the graph I would have known there was an outage but not its shape.

The discipline the graphs force

This is the part that surprised me. Once you have graphs of a few measurements, the natural question is "what should normal look like?" — and answering it forces you to write down expectations explicitly.

For my outbound traffic on the modem link, normal is about 50 kbit/s sustained, with peaks to 250 kbit/s during downloads, and quiet periods near zero overnight. I have written this down. If I see traffic outside that envelope, I look more carefully.

For my mail volume, normal is about 30 messages per day inbound, 5 outbound, with occasional spikes when I receive a digest or send a long thread. I have written this down. A sudden spike to 500 inbound would tell me I had been added to a list, or something else worth investigating.

For my Snort alerts, normal is about 20 to 40 alerts per day, mostly low-severity scan-type events. I have written this down. A sudden spike to 1000 would tell me something serious is happening.

The act of writing down what "normal" looks like is what makes "abnormal" detectable. Without explicit expectations, every anomaly looks like business as usual. With explicit expectations, anomalies stand out.

What this is not

A few things MRTG is decisively not, which is worth being clear about because vendor pitches in this space often blur the distinction.

It is not real-time alerting. The graphs update every five minutes. If you need to know about a problem in 30 seconds, this is the wrong tool. The right tool is something on top of tail -F or a more sophisticated event-stream watcher.

It is not a database. The data is in compact log files that have been pre-aggregated for the time scales MRTG cares about. You cannot query it for "all five-minute samples in May where traffic exceeded 200 kbit/s". The data is there but the access patterns are limited.

It is not anomaly detection. MRTG shows you the data; deciding what is and is not anomalous is your job. There are tools that try to automate this — statistical baselines and so on — but they are at the early stages.

The wider monitoring problem includes all of the above. MRTG is one piece. It happens to be the piece that gives the highest value for the lowest effort, in my experience.

A practical setup

For a small network, the SNMP and MRTG setup that has worked for me:

Each device runs an SNMP daemon configured to allow read-only access from a specific address (the MRTG host) using a non-default community string. The default community is public; mine is something else, scoped to one IP.
The MRTG host runs the polling cron every five minutes, the page regeneration every fifteen.
The output HTML is served by Apache from a private virtual host accessible only from my LAN. The graphs are not on the public internet.
A small index page links to the graphs grouped by category — network, hosts, applications.

Total disk usage for three months of data across about thirty graphs is under 50 MB. The CPU load is negligible. The setup time was an afternoon.

What I am moving towards

The natural next step beyond MRTG is something that does alerting on the same data — a process that reads the same SNMP counters, applies thresholds, and pages you when they are violated. Tools in this category exist. Big Brother is the one most people I know are running. The next-generation tool I keep hearing about, Nagios, is supposedly much more flexible.

For now, I am content with the rhythm of looking at graphs in the morning. The point of measurement, for a small operator, is not to react in real time to every wobble — it is to develop a sense, over time, of what the system normally does. The morning routine of glancing at graphs is enough to do that.

When I scale up — or when something I run becomes critical enough that the morning routine is too slow — I will graduate to alerting. Until then, MRTG is the right tool for the right level of operational maturity.