Linux capabilities, slowly becoming usable

POSIX capabilities arrived in the Linux kernel with version 2.2, which I wrote about in May. The kernel-side support has been there for nearly a year. The user-space tooling has been catching up slowly, and is now at the point where capabilities are usable — though still rough — for serious deployment.

This post is a primer on what capabilities are, why they matter, and how I have started using them in my own setup.

The problem capabilities solve

In classical Unix, the security boundary is binary. A process is either root (UID 0), in which case it can do essentially anything, or it is not root, in which case there are many things it cannot do. There is no middle.

This binary model is the source of an enormous amount of pain. A process that needs one root-only privilege — say, the ability to put a network interface into promiscuous mode — has to run as full root, with all the other root privileges enabled, even though it does not need them.

The consequence: if the process has a bug — a buffer overflow, a path traversal, anything that lets an attacker influence its execution — the attacker inherits full root, regardless of how minimal the legitimate privileges of the process actually were.

Capabilities slice root into pieces. There is a defined set of named privileges: CAP_NET_BIND_SERVICE (bind to ports below 1024), CAP_NET_RAW (open raw sockets), CAP_NET_ADMIN (configure network interfaces), CAP_SYS_ADMIN (a catch-all for various system administration tasks), CAP_DAC_OVERRIDE (override file permission checks), and about thirty others.

A process can hold any subset of these capabilities. A process holding only CAP_NET_RAW can put an interface into promiscuous mode but cannot read anyone else's files. A process holding only CAP_NET_BIND_SERVICE can bind to port 80 but cannot do anything else of consequence.

The binary distinction between root and non-root is now a richer space. Privilege can be granted exactly to the level needed.

What this enables, if it is used

The immediate application is to network daemons. A web server, for instance, needs to bind to port 80 (privileged) and that is essentially all it needs. Under the binary model it runs as root for the privileged setup, then drops to a non-root user; under capabilities, it could be granted only CAP_NET_BIND_SERVICE and could perform the bind without ever being root.

The same applies to tcpdump — needs CAP_NET_RAW only — and to traceroute, ping, and a long list of network utilities. Each of these has historically been setuid root, with all the risk that implies. With capabilities, each could be granted exactly the privilege it needs and no more.

This would, in theory, eliminate a large category of privilege-escalation vulnerabilities. A bug in a binary that has only CAP_NET_RAW is dramatically less dangerous than the same bug in a binary that has full root.

Why the transition is slow

The kernel support has been there since 2.2. The user-space transition has been slow for several reasons.

File capabilities are not yet supported. In the current Linux model, capabilities are properties of processes, not of files. A binary cannot be marked "runs with CAP_NET_RAW only" the way it can be marked setuid. The capabilities have to be granted by an explicit caller. This means the current setuid-root pattern has no clean drop-in replacement.

File capabilities are coming — there is ongoing work to add them — but the current 2.2 kernel does not have them. So the user-space tooling cannot yet use them.

The capabilities API is rough. The system calls for setting and getting capabilities — capset, capget — are usable but not pleasant. The constants are defined in headers that have changed across kernel versions. The library support (libcap) is improving but still requires care.

Most daemons have not been updated. Apache, sendmail, BIND, and the other major daemons still use the setuid-root-then-drop-privileges pattern. They could use capabilities; they do not. Updating them is a non-trivial change to mature codebases that few maintainers want to make speculatively.

Distribution support is minimal. No major distribution ships binaries with capability information. The infrastructure to mark binaries as "this should run with capability X" does not exist in the package management.

What can actually be done today

A few things, with care.

Drop unnecessary capabilities at process start. A daemon that does not need CAP_DAC_OVERRIDE can drop it shortly after starting up. The relevant calls are in <sys/capability.h>. The discipline is to enumerate what you actually need and drop the rest.

The call sequence, roughly:

#include <sys/capability.h>
cap_t caps = cap_get_proc();
cap_value_t keep[] = { CAP_NET_BIND_SERVICE };
cap_clear(caps);
cap_set_flag(caps, CAP_PERMITTED, 1, keep, CAP_SET);
cap_set_flag(caps, CAP_EFFECTIVE, 1, keep, CAP_SET);
cap_set_proc(caps);
cap_free(caps);

— with appropriate error handling. The result is a process that has only the bind-to-privileged-ports capability, regardless of what it could have done before.

Run a wrapper that drops privileges. For daemons you do not control, a wrapper program can be written. It starts as root, sets up its own capabilities, drops to a non-root user, and execs the daemon binary. The daemon then runs with only the capabilities the wrapper retained.

Use prctl to lock down further. The prctl(PR_SET_KEEPCAPS, ...) and related calls let a process keep capabilities across a setuid drop. This is the mechanism by which a daemon can start as root, drop to a regular user, and still retain a single capability. Without this, the setuid drop normally clears all capabilities.

What I have actually done

For my own infrastructure, modest experiments:

A capability-aware tcpdump wrapper. A small C program that takes the tcpdump arguments, sets up a process with only CAP_NET_RAW, and execs tcpdump. This lets me run tcpdump as a non-root user without making the binary setuid root. The wrapper itself is setuid root but it is a tiny program (twenty lines) which is much easier to audit than the full tcpdump.

A web server with no setuid. A small Apache replacement (intended only for testing) that starts non-root, requests CAP_NET_BIND_SERVICE from a daemon I have written, and binds to port 80 with the granted capability. This is more of a proof-of-concept than a production tool. The Apache project is not going to adopt this pattern soon.

Where this is going

My expectation is that the next two or three years will see:

File capabilities arriving in mainstream kernels.
Major daemons being updated to use them.
Distributions starting to ship binaries with appropriate capability information.
The setuid-root binary becoming a relic, like raw root logins.

This is going to take a while. Each step depends on the previous and on the willingness of maintainers to make non-trivial changes. The improvement is structural and the timescale is years.

For an operator paying attention, the right thing to do today is to understand the model so that, when it matures, you can adopt it quickly. The user-space tooling is rough but the principles are not. Reading the capabilities(7) manual page and writing a small experiment yourself is the cheapest way to develop the understanding.

The binary root model was the right answer in 1979. We are nearly twenty years overdue for the next answer. Capabilities are it. The transition is slow but inevitable.