Reading the kernel network stack

Last weekend I sat down with the source for Linux 2.2.1 and tried to read the TCP stack end to end. I have been writing about networks for a year and assuming I understood TCP because I had read Stevens, and I wanted to find out whether that assumption was earned.

It was not. But I learned more in one evening of source-reading than in three months of any other activity I have done on this discipline.

What I expected

I expected the TCP code to be a tangled mass of state-machine logic, full of edge cases and historical patches, with security-critical logic spread across many files in non-obvious ways. The reputation of kernel networking is that it is byzantine.

What I found

The state machine itself is implemented essentially in two files: net/ipv4/tcp.c and net/ipv4/tcp_input.c. They are large — several thousand lines between them — but they are organised around a small number of named functions that correspond, line-for-line in many cases, to the descriptions in the RFCs.

For instance, tcp_rcv_state_process() handles incoming segments based on which state the connection is in. It is structured as a switch on the connection state: TCP_LISTEN, TCP_SYN_RECV, TCP_ESTABLISHED, and so on. Each branch is short. Each branch refers, in comments, to the section of RFC 793 that describes the corresponding behaviour.

This is — surprising as it sounds — easier to read than most application code I work with daily. The discipline of "this branch implements section 3.7, paragraph 3 of the RFC" is one I rarely see in user-space code.

The security-relevant bits

The code that matters most for security is the part that handles connection state transitions, particularly the SYN-handling code.

The tcp_v4_conn_request() function is where an incoming SYN is processed. It allocates a structure, records the source's information, and replies with a SYN-ACK. This is the function whose behaviour determines how the kernel reacts to a SYN flood.

Reading this function in detail explains a few things that I had only second-hand-understood before. The kernel maintains a SYN queue — a list of half-open connections waiting for the third packet of the handshake. The queue is bounded. When it fills up, new SYNs are dropped. This is the precise mechanism by which a SYN flood denies service: the attacker fills the queue with SYNs from spoofed source addresses, the legitimate SYNs cannot get in, and the service appears unreachable.

The defensive code I was reading also implements SYN cookies, which is Dan Bernstein's clever way of avoiding the queue-exhaustion attack. With SYN cookies, the kernel does not allocate state for incoming SYNs at all; instead, it cryptographically constructs the initial sequence number for the SYN-ACK such that the eventual ACK can be verified without remembering the original SYN. The state goes onto the wire instead of into kernel memory.

Reading the SYN cookie generation code — the function cookie_v4_init_sequence() — was the moment when I understood, properly, why this technique works. It is genuinely clever, and the code is short enough to be read in five minutes. I had read about SYN cookies in a security article a year ago and thought I understood them. I did not. There is no substitute for reading the source.

The bit I had not appreciated

The network stack does not exist in isolation. It is woven into the kernel's general memory management, scheduling, and locking. Reading the TCP code led me, repeatedly, into the sk_buff management code — the structure that represents a packet through its entire lifetime in the kernel.

The sk_buff is interesting because it is shared across nearly every networking subsystem. The same buffer is touched by the link-layer driver, the IP layer, TCP, the socket layer, and the application via the socket interface. Every one of those touches has potential for bugs — a use-after-free, an off-by-one in length calculation, a missing reference count. Several historical kernel security bugs have been in this management layer specifically.

This is also the part that explains why kernel-level vulnerabilities are catastrophic. A bug in sk_buff handling can be triggered by malformed packets arriving on the wire, before any user-space code has had a chance to filter them. The bug runs in kernel context, with full privileges. There is no recovery short of rebooting.

What I am taking away

Three things.

One, the kernel is readable. I had built up an unjustified mental block about reading kernel source. The state-machine functions in tcp_input.c are easier to follow than most code I write.

Two, RFCs are useful. I had treated them as reference material. They are reference material, but for someone reading the source, they are also a reading guide. "This function implements RFC 793 section X" tells you both what the function does and what authoritative description to consult when something is unclear.

Three, the historical bugs make more sense after reading the code. Every time I read a Bugtraq post that says "a bug in kernel code path X allows Y", I now have a much better mental model of where path X actually is and why the bug is hard to spot. This makes future Bugtraq reading more valuable, not less, because I am reading with a better model.

I am going to keep doing this. The next thing on the reading list is netfilter, which is the new framework that is set to replace ipchains in the 2.3 development series. If I am going to be writing firewall rules for the next decade, knowing the actual code that interprets them is going to pay off.