Structured logs at scale

Two years of structured-log discipline has produced a corpus that is starting to ask scaling questions. The volume on my own setup is now about 200MB of structured logs per day, retained for a year; total storage approaching 80GB.

This post is a walk through what works and what does not as the data grows.

What still works at this scale

Daily aggregation. A small set of awk/grep/sort/uniq pipelines that produce the morning digest of yesterday's events. Total compute time: a few minutes per day. Output is the kind of report I have been generating since 1999. The simplicity scales fine for daily windows.

Specific-pattern queries. Looking for a particular event class within a known time window. grep against the relevant file is fast enough; the pattern is easy to express. This works for any retention horizon as long as you know roughly when to look.

Real-time alerting on top of tail -F. Watching the most recent events for specific patterns and alerting on them. The data volume is irrelevant; only the recent stream matters.

What is breaking

Cross-time-window analysis. Questions like "how often does this pattern occur across the entire year?" require scanning the full corpus. With 80GB of compressed logs, a full scan takes hours. This is too slow for interactive exploration.

Multi-source correlation. Joining events across multiple log sources (firewall + Snort + Apache + Sendmail) to follow a single "chain" of activity is awkward with text-file tooling. The data is structured per-line but the cross-source structure has to be reconstructed each time.

Statistical analysis. Computing distributions, time-series statistics, anomaly scores against the full corpus is essentially impossible with awk alone. Some of these queries take days to run.

What I have tried

Three things, with mixed results.

Pre-computed indices. A nightly batch job that builds index files for common query patterns. Works; the indices accelerate the targeted queries. Cost: extra disk space, batch-job complexity. Benefit: 100x speedup on indexed queries.

Per-source structured columns. Restructuring the logs from key=value text to fixed-column tab-separated values, with the columns chosen to match common queries. Faster to query but loses some flexibility (new fields require schema changes). Mixed results.

Pushing some data into a relational database. Specifically the firewall.drop events, which are the highest-volume and most-queried. The database (PostgreSQL) handles them well; queries that took hours run in seconds. Cost: another moving part to maintain.

What I am trying next

SQL as the analysis substrate for high-volume logs. Continuing the database experiment, expanding to other high-volume sources. The database is more rigorous than text files; the schema discipline is more demanding but the query power is much greater.

Better cross-source correlation. A "trace ID" propagated through the system that ties together events for the same logical operation. This is operationally non-trivial — it requires modifying every service to propagate the ID — but enables clean cross-source queries.

Sampling for statistics. For statistical questions where exactness is not required, sampling 1% of the data is much faster and produces approximately-correct answers. The discipline is to be explicit about when sampling is appropriate.

What this teaches

The discipline of structured logging works at small scale. At larger scale, the simple text-file model starts to break and structural changes are needed.

This is the standard pattern of operational tooling. Small things scale by accident; medium things scale by design; large things scale by replacing the architecture. I am at the boundary between small and medium; the structural changes are happening incrementally.

For anyone running a similar setup at similar scale: expect to hit similar limits. The migration to a database for high-volume sources is the natural next step.

For anyone running at much larger scale: the patterns are well-known to operations teams in larger organisations. The advice is similar but the toolchain is different — log aggregation systems, time-series databases, specialised analysis tools.

More as the year develops.


Back to all writing