robots.txt Mining Script for the Lazy

by KellyKeeton.com

Hackers are lazy.

I am; I like to have a tool to do everything for me.

How often do you troll a hacker BBS and find the post "HELP MUST GET WORKING IN WINDOWZ"?

No doubt from a script kiddy who has no idea, nor will he take the time to look up, what a compiler and make are used for.  This'll be followed up the proverbial reply from the "DarkLord' (you know, the guy with the 3000 post count) who locks the thread with a "learn to Google" reply.

Sure, there is good reason to make people get smarter and use tools, but, then again, who cares?

I think it must all be a ego thing - I was that dumb kid some years ago, asking how to get some tool to work in Windows, only knowing little more than how to break it.

What I'm here to do today is help the script kiddies hack on web servers.

The world has taken me to penetration testing, using the big, cool boy tools.

Nessus is a good place to start (if you didn't know) and, yes, it runs on Windows.

However, something that always bugged me about Nessus reports was the little line "Server contains a robots.txt please examine for further detail."

I don't want to go examine it, that's why I'm using this automated tool in the first place.  I'm lazy, get on with it!

Now, a quick little history lesson.

If you didn't know, robots.txt was (and is) a file used for setting rules for user agents in use of the site, specifically where not to look.  Particularly search engines - people didn't want search engines to index their entire site and spit out content that is dynamic or, in the case of 2600 readers, content that is private, confidential, or otherwise shouldn't be on the web publicly.

A practice that is not as prevalent as it was back in the good old days is to hide folders from Google, etc. with robots.txt.  Yes, people would stoop to such levels as that.  So first, why is this so horrible?  Sure, Google is friendly and they play by the rules.  But who is to say that the hackoogle search engine wont just pop-up, say F.U. robots.txt, start scouring the domain for anything tasty, index it, and allowing people to search for juicy "nuggets?"

Back to the 31337 web site operators, how is this robots.txt good for them?

Well, those people that put /CVS into it, might be leaving the world a free copy of their code.

My personal favorite are smaller software firms that put /download, /ftp, or /registered into the robots.txt file.

These are great places to start mining around for default pages that will let you download full copies of an application without paying for it.  Not like anyone here would do that.

The basics of looking at a robots.txt are very simple.

Browse to 2600.com/robots.txt and any web browser will pull back the TXT file.

Cool.

Well, again, this is nice but you must then cut-and-paste the results onto the URL bar to see the goodies, or hit the Back button, or Tab all over.  Who needs that?

I have come to the rescue of the script kiddy - I recently broke my ankle and, after getting frustrated with the motorcycle missions 40% of the way into GTA-IV, I wrote this script.

It's very simple, just putting HTML wrappers on things, but I hope to make the day much simpler for someone somewhere.

#!/bin/bash
# robotRepoprter.sh -- a script for creating web server robot.txt clickable
# reports
# by KellyKeeton.com (c)2008
version=.06
# Don't forget to chmod 755 robotReporter.sh or there will be no 31337 h4x0r1ng
   if [ "$1" = "" ]; then # Deal with command line nulls
   echo
   echo robotReporter$version - Robots.txt report generator
   echo will download and convert the robots.txt
   echo on a domain to a HTML clickable map.
   echo
   echo Usage: robotReporter.sh example.com -b
   echo
   echo -b keep orginal of the downloaded robots.txt
   echo
   exit
   fi
   wget  -m -nd http://$1/robots.txt -o /dev/null # Download the robots.txt file
   if [ -f robots.txt ]; then # If the file is there, do it
   if [ "$2" = "-b" ]; then # Don't delete the robots.txt file
   cp robots.txt robots_$1.html
   mv robots.txt robots_$1.txt
   echo "###EOF Created on $(date +%c) with host $1" >> robots_$1.txt
   echo "###Created  with  robotReporter  $version - KellyKeeton.com" >> robots_$1.txt
   else
   mv robots.txt robots_$1.html
   fi
   # HTML generation using sed
   sed -i "s/#\(.*\)/ \r\n#\1<br>/" robots_$1.html # parse comments
   sed -i "/Sitemap:/s/: \(.*\)/ <a href=\"\1\">\1<\/a><br>/" robots_$1.html # parse the sitemap lines
   sed -i "/-agent:/s/$/<br>/" robots_$1.html #parse user agent lines
   sed -i "/-delay:/s/$/<br>/" robots_$1.html #parse user agent lines
   sed -i "/llow:/s/\/\(.*\)/ <a href=\"http:\/\/$1\/\1\">\1<\/a> <br>/" robots_$1.html # parse all Dis/Allow lines
   echo "<br>Report ran on $(date +%c) with host <a href=\"http://$1\">$1</a><br>Created with robotReporter $version - <a href=\"http://www.kellykeeton.com\">KellyKeeton.com</a>" >> robots_$1.html
   echo report written to $(pwd)/robots_$1.html
   # done
   else # wget didn't pull the file
   echo $1 has no robots.txt to report on.
   fi

Code: robotReporter.sh


Example Usage:

$ ./robotReporter.sh

robotReporter.06 - Robots.txt report generator
will download and convert the robots.txt
on a domain to a HTML clickable map.

Usage: robotReporter.sh example.com -b

-b keep orginal of the downloaded robots.txt

$ ./robotReporter.sh 2600.com -b
report written to robots_2600.com.html
Return to $2600 Index