CGI vulnerabilities and the joy of reading the source

I have been writing a small CGI app this week to teach myself how Apache and Perl talk to each other. Like every other CGI app in the world, mine started life full of subtle, exploitable mistakes.

I want to walk through two of them. They are both the kind of bug that has been showing up on Bugtraq in commercial products this whole year. Reading my own bad code teaches the lesson better than reading someone else's.

Mistake one: passing user input to the shell

The app had a feature where the user could enter a hostname and the page would print the whois information for it. My first version of this looked, more or less, like:

my $host = $cgi->param('host');
my $output = `whois $host`;
print $output;

This is wrong. The backticks operator in Perl runs the contents through the shell. The shell does not respect the boundary between the program name whois and its argument $host. If the user passes a $host value of example.com; cat /etc/passwd, the shell happily runs both commands and the second one's output is included in the page.

This is so old a class of mistake that there is a name for it: shell injection. It is the same family of bug that hits commercial products with weary regularity. The fix is to never pass user input through the shell. Ever.

The corrected version uses a list-form invocation:

my $host = $cgi->param('host');
open(my $fh, '-|', 'whois', $host) or die;
my $output = do { local $/; <$fh> };
close $fh;

This runs whois directly, with $host as an argument list element. The shell is not involved. The semicolon trick stops working.

Mistake two: trusting the filename the user sends

The second feature was a way for the user to view the contents of a small text file. I had the bright idea of letting the user choose which file:

my $name = $cgi->param('file');
open(my $fh, '<', "/var/www/notes/$name") or die;

The attacker passes a $name of ../../../etc/passwd. The script obligingly opens /var/www/notes/../../../etc/passwd, which the operating system kindly resolves to /etc/passwd, and the contents are read straight back to the attacker.

This is path traversal. Again, an old, well-understood mistake. Again, the fix is at the boundary: do not trust the user to choose a filename that includes path separators. Strip everything that is not alphanumeric. Or, better, look up the user-provided string in a table of allowed names, and reject anything that is not in the table.

Why CGI gets these wrong so often

The interface is, structurally, hostile. The web server takes a chunk of network input and hands it to your script as environment variables and standard input. From the script's point of view, everything is user-controlled. There is no border that automatically cleans things on the way in.

For a developer used to local programs, where input mostly comes from a file you wrote yourself, this is a complete inversion. It takes practice to internalise it.

The discipline I am trying to learn

I am trying to internalise three rules.

One: every value that came from outside the program is suspect until proven otherwise. Form fields, URL parameters, headers, even cookies you set yourself.

Two: when you pass that value to another program, do not let the shell interpret it. Pass it as a list element to a list-form invocation.

Three: when you pass that value to a filesystem call, never let it contain path separators or escape sequences. Validate it positively against a whitelist.

The simplicity of these rules is itself a bit of a trap. They are easy to nod along to. They are not easy to apply consistently to every line of code in a working application. Practice — including practice by doing them wrong on my own machine — is, I think, the only way to make them habit.