Path-based attacks against web servers

Last week's log scanner caught a series of probes against my web server that all had a particular shape. URLs containing ../ segments, URLs ending in ..%2f..%2f, URLs with %00 in the middle, URLs with backslashes where the server expected forward slashes.

This is the family of attacks called path traversal or directory traversal. They are old. They are still everywhere. And the pattern of what produces them is consistent enough that it is worth writing down.

The fundamental mismatch

A web server takes a URL — a string — and turns it into a request for a file. The file is somewhere on the local filesystem. The mapping from URL to filename is the security boundary.

For most servers, the mapping is built by:

Take the document root, e.g. /var/www/html.
Append the path component of the URL, e.g. /about/contact.html.
The resulting filename is /var/www/html/about/contact.html.

Step 2 is where the trouble lives. If the URL contains something like ../../../etc/passwd, the literal concatenation produces /var/www/html/../../../etc/passwd. The operating system, when asked to open this filename, follows the .. segments to resolve them. The result is /etc/passwd. The web server cheerfully reads it.

This is path traversal in its simplest form. It has been documented for at least 15 years. It is still the basis of an enormous fraction of vulnerability scanners and exploit attempts.

The variants you have to handle

The naive defence is: reject any URL containing ../. This works against the simplest attacker. It does not work against any of the following:

URL-encoded variants. %2e%2e%2f is the encoded form of ../. Many web servers decode the URL first, then check for ../, which is the wrong order — the encoded version slips through.

Double-encoded variants. %252e%252e%252f is the encoding of the encoding. If the server decodes once and then checks, this passes. If the server decodes twice (which some do, after passing through certain CGI gateways), the attack succeeds.

Mixed separators. On Windows-based web servers, ..\..\..\etc\passwd is equivalent to ../../../etc/passwd. Code that checks only for forward slashes misses the backslash variant.

Null-byte truncation. In some languages — notably C — strings end at the first null byte. A URL like safe.html%00../../../etc/passwd may be checked as safe.html (with the null byte stopping the comparison) but opened as the longer string. The defence and the action use different string boundaries.

Unicode tricks. Some Unicode characters are visually similar to slashes, or can be normalised into slashes by some libraries but not others. The 2001 IIS Unicode bug is the most famous example, and was, technically, a path-traversal variant by way of a normalisation mismatch.

The right way to defend

None of the above is fixed by adding more patterns to the blacklist. That game is unwinnable. The right architecture is:

Decode the URL fully, with all encodings and normalisations applied, in one explicit step. Whatever your library does for this, do it once and inspect the output.
Resolve the path against the document root as the operating system would, including following .. segments and symbolic links. The result is the actual filename that would be opened.
Check that the resulting filename is still inside the document root. This is a string-prefix check on the resolved path, not on the original URL.

The third step is the one most defenders skip. They check the URL. They do not check the result of resolving the URL. Path traversal is fundamentally a property of the resolved path, so that is the only check that works.

In pseudocode:

url_path = decode_fully(url_path)
full_filename = canonicalise(document_root + '/' + url_path)
if not full_filename.starts_with(document_root + '/'):
    refuse the request

The canonicalise step does the symlink-following, the .. resolution, the case-folding on case-insensitive filesystems, and so on. The starts_with step is the actual defence.

The Apache case, specifically

Apache 1.3, which is what I run, does this correctly out of the box for static content. The DocumentRoot directive and <Directory> blocks together prevent files outside the configured tree from being served, regardless of the URL path used to ask for them.

The danger area is CGI scripts. Apache does not protect CGI scripts from doing this badly themselves. If a CGI script takes a filename from the URL and opens it, all of the above failure modes are the script's responsibility to handle. They almost never do.

This is why the PHF bug was so devastating in its day. PHF was a CGI script that, due to a different but related class of bug, allowed arbitrary command execution on the server. The CGI was the gap; Apache's static-file safety was bypassed.

If you write CGI scripts, treat every filename-like input from the user as hostile. Pass it through your equivalent of canonicalise, then check the result is in the directory you expect. There is no shortcut.

What the probes I caught looked like

From my logs last week, in chronological order from the same source IP:

GET /../../../../etc/passwd
GET /%2e%2e/%2e%2e/%2e%2e/etc/passwd
GET /..%2f..%2f..%2fetc%2fpasswd
GET /cgi-bin/phf?Qalias=x%0acat%20/etc/passwd
GET /cgi-bin/test.cgi/../../etc/passwd

All within about thirty seconds. This is a scanner running through a fixed list of known probes. Apache responded with 404s to the static-file ones (it correctly resolved the paths and refused). It would have responded the same way to the CGI ones if I had any of those scripts; I do not, so they all 404ed.

The interesting feature of the list is its predictability. The scanner is not exercising creativity; it is running a known recipe. Which means a small set of Snort rules covers most of what is in the wild.

The general lesson

Path traversal is a specific class of bug, but it is an instance of the larger pattern: an input that travels through multiple layers of decoding, where the security check is in one layer and the dangerous action is in another. Defence requires aligning the layer of the check with the layer of the action.

This pattern shows up everywhere. It will show up next year and the year after, in different forms. Anyone who builds an intuition for it from path traversal will recognise it when it shows up as SQL injection, command injection, header injection, deserialisation bugs, or any of the future categories that have not been named yet.

The specific bugs are temporary. The pattern is forever.