Writing structured logs for analysis

I have spent a substantial portion of the last year writing grep idioms and Perl scripts to extract structure from unstructured log files. The work is fine. The work is also a kind of after-the-fact archaeology — reverse-engineering structure that was never deliberately put into the files.

Three months ago I started writing applications that produce structured logs by design. The exercise has changed how I think about what a logfile is for. I want to write down the principles, because I think this is going to become a more important practice as systems get more complex.

What a structured log is

A traditional Unix log line is human-prose. The Apache access log is the canonical example:

203.0.113.7 - - [22/Aug/1999:12:01:33 +0000] "GET /index.html HTTP/1.0" 200 1623

This is parseable — there is an implicit format — but the format is positional. Field one is the IP. Field two is the ident username (always "-" in practice). Field three is the user. The square-bracketed thing is the timestamp. And so on.

If the format changes — say, someone adds a Referer field at position seven — then every script that parses the file has to be updated. The format is fragile.

A structured log is one where each line carries explicit field names. A trivial form:

ts=1999-09-06T12:01:33Z ip=203.0.113.7 method=GET path=/index.html status=200 bytes=1623

Each field is name=value. A parser does not need to know the order. New fields can be added without breaking old parsers. Old fields can be removed without confusing new readers — they will just see the absence and proceed.

This is sometimes called logfmt in the new wave of writing about this. The idea predates the name.

Why this matters more than it sounds

Three practical things change when you commit to structured logs.

Querying becomes uniform. Every log line has the same shape: a list of fields. A single tool can query any logfile. "Show me all events where status != 200" works whether the log is from Apache, my own application, or any other tool I have written. No per-format parsing.

Field consistency becomes enforceable. If I commit to using ip= for every IP-typed field, I can write tools that aggregate IPs across logs from different services. "Top ten IPs by failed-auth events across all my services" becomes a single query. With unstructured logs, this is per-tool work.

Future fields are cheap. When I want to add a new piece of context — say, a request-id that ties together log lines from different services for a single request — I just start emitting req_id=... everywhere. No coordination required. Existing parsers continue to work; new parsers that care about the field can use it.

What I have actually done

For my own applications, I have a small Perl module that emits log lines in this format. The interface is:

use Log::Structured;
log_event(level => 'info', event => 'request_received',
          ip => $ip, method => $method, path => $path);

This produces:

ts=1999-09-06T12:01:33Z level=info event=request_received ip=203.0.113.7 method=GET path=/index.html

The ts and level fields are added automatically. The other fields come from the call.

The module is not sophisticated. It enforces a few rules about field-name format (lowercase, underscores, no spaces) and quoting (values with spaces or equals signs are wrapped in double quotes with embedded quotes escaped). That is essentially it.

For the existing services I run that I do not control — Apache, sendmail, BIND — I have started using a small parser that reads their native logs and produces a structured equivalent for downstream tools. Apache's combined log format becomes:

ts=... ip=... method=... path=... status=... bytes=... user_agent="..." referer="..."

The parser is per-source, but the output is uniform.

Where the discipline pays off

A few specific cases where the structured-log approach has saved me real time.

Cross-service correlation. Last week I needed to investigate a slow request. The user had reported it via email, with an approximate timestamp. The request had passed through Apache, my CGI script, and a backend database service. With structured logs, all three log files were filterable by the same time window and by the user's IP. The relevant lines from each service were in front of me in seconds. Without structured logs, the same investigation would have taken half an hour of grep-and-cross-reference.

Aggregate health metrics. Producing a daily "how is the service doing" digest is now trivial. awk -F= '{ for(i=1;i<=NF;i++) ... }' over a logfile of structured events produces aggregations by any field. The same idiom works for any service.

Anomaly detection. Counting events of a specific type per unit time gives me a baseline. Structured logs make the count easy. The log scanner I built has become substantially simpler and faster since the logs it reads are uniform.

What this is not

A structured log is not a database. The query patterns are still scan-based; you read the file and filter. For high volumes — millions of events per day — you eventually need a real database. The progression in my experience has been: unstructured logs → structured logs → indexed structured logs → real database. Each step is a smaller change than it looks, because the structured-log step gives you a clean interface from which to feed the indexed and database layers.

A structured log is also not a replacement for the human-readable form. A line of ts=... ip=... method=... path=... status=... is harder for a human to skim than the Apache prose form. Most operators would, for casual reading, prefer the prose. The right answer in my experience is both: a service produces both a human-readable log (for ad-hoc reading) and a structured one (for analysis). The cost of producing both is small.

A small note on standards

There are a few partially-overlapping conventions for what structured logs should look like. The key=value form I am using is one. JSON logs (one JSON object per line) are another. CSV is a third. XML is, mercifully, almost never used for this.

For my purposes the key=value form is the easiest balance — readable enough by humans, parseable enough by scripts, low overhead in the writing. JSON is more rigorous and probably what I would use if my tooling were heavier. CSV is too rigid about field ordering for my taste.

The specific format matters less than committing to a format. The thing that produces the value is consistency. A logfile that is sometimes structured and sometimes prose is no better than one that is always prose, because every parser still has to handle both cases.

What I would do differently if I were starting over

If I were starting from scratch on a new system today, I would commit to structured logs from day one, with a small set of conventions:

Every event has a ts (ISO 8601 UTC), a level, and an event (a short verb-like name).
Every event has a service field naming the producing service.
Common types — IP, HTTP method, user, request-id — have agreed-on field names that all services use.
Logs are written to per-service files, then forwarded to a central place where they are merged for aggregate queries.

This is more discipline than most operators apply to logging. It is, I am convinced, the right discipline. The cost is low, applied at design time. The benefit compounds for as long as the system runs.

The operators I respect most are the ones who treat their logs as a designed artefact rather than a side effect. I am slowly trying to become one of them.