Reading the netfilter source, properly

Following my migration to Linux 2.4 and iptables, I have spent two weeks reading the netfilter source carefully. The architecture I described from the design documents in 2000 holds; the implementation is, on the available evidence, well-disciplined.

This post is the writeup. Specific things that stood out, with implications.

The hook architecture

The core of netfilter is a small set of hook functions called from specific points in the network stack. The hooks are at:

  • NF_IP_PRE_ROUTING: after packet receipt, before routing decision.
  • NF_IP_LOCAL_IN: after routing, for packets destined locally.
  • NF_IP_FORWARD: for packets being routed through.
  • NF_IP_LOCAL_OUT: for locally-generated packets, before routing.
  • NF_IP_POST_ROUTING: after routing, before packet transmission.

At each hook, registered modules are called in priority order. Each module can let the packet continue, drop it, queue it for userspace processing, or stolen for further handling.

The interface is minimal. The modules are independent. Reading the core code (net/core/netfilter.c plus the per-protocol hook code) is straightforward; the whole framework is a few hundred lines.

The conntrack module

The most interesting subsystem is ip_conntrack — the connection-tracking module that maintains state for active connections.

The state machine is per-connection. For each TCP connection (or each pseudo-connection for UDP/ICMP), the module tracks:

  • The four-tuple (src IP, src port, dst IP, dst port).
  • The current TCP state (NEW, ESTABLISHED, FIN_WAIT, etc).
  • The expected next-segment behaviour.
  • The timeout for cleanup.

Reading this module's source has clarified for me how the iptables state matches actually work. state ESTABLISHED means the conntrack entry exists and is in a state that matches an established connection. The accuracy is much better than the TCP-flag-matching approach I used to use.

The module also exposes its state via /proc/net/ip_conntrack. I have written a small monitoring script that reads this and produces a summary of active connections, which has become useful for general operational visibility.

The NAT module

NAT (network address translation) is implemented as a netfilter module. The module hooks at PRE_ROUTING (for incoming packets that will be translated to a local destination) and POST_ROUTING (for outgoing packets whose source needs translating).

The NAT decision is per-connection — once the first packet of a connection has been translated, all subsequent packets get the same translation. The translation is stored in the conntrack entry; performance is good because the lookup is constant-time.

Reading the NAT code has made me more comfortable with using NAT in serious deployments. The earlier ipchains MASQUERADE was opaque and occasionally produced subtle bugs. The 2.4 NAT is documented, well-organised, and predictable.

The string-match module

This surprised me. The string match module — which lets a rule match on arbitrary substrings of the packet payload — is a minimal Boyer-Moore string searcher embedded in the kernel.

The security implication is that you can do basic content-based filtering without an external IDS. Rules like "drop packets containing /cgi-bin/phf" are now expressible. This is not a substitute for Snort — Snort has reassembly, normalisation, and a much richer rule language — but it is useful for specific high-priority blocks.

What this changes about my mental model

Three things.

The framework is more capable than the documentation suggests. Reading the code reveals capabilities (specific match types, particular targets) that the userspace iptables documentation does not emphasise. Some of these are useful for specific operational needs.

The performance characteristics are knowable. The conntrack lookup is hash-based, the rule-walk is sequential, the string-matcher is Boyer-Moore. The performance can be reasoned about; bottlenecks are predictable.

Custom modules are tractable. Writing a netfilter module is similar to writing a Snort preprocessor — small interface, focused responsibility, achievable in a weekend if the requirement is bounded. I am thinking about a custom module for honeypot-style packet redirection.

What I am going to do

For my own infrastructure: continue with the iptables setup, with more confidence now that I understand what the framework is doing.

For my own learning: write a small netfilter module that does something specific to my honeypot use case — probably a per-source packet-rate accumulator that emits structured events when thresholds are exceeded. The exercise will deepen my understanding further.

For my own writing: more of my detection-related posts will assume netfilter familiarity. The migration is now widespread enough among operators that the assumption is reasonable.

More as the year develops.


Back to all writing