Robots and Spiders

by StankDawg (

Everyone uses search engines, but did you ever wonder how they choose which pages to list and which pages to not list?  You've all heard stories of private pages that get listed when they weren't supposed to.  What stops these search engines from digging into your personal information?  Well, without going into a lecture on why you should never store personal information on a publicly accessible web site, let's talk about how search engines work.

The World Wide Web was named such because of the cliché that all of the pages are linked to each other like a spider's web.  A search engine starts looking one a page and follows all of the links on that page until it gathers all of the information into its database.  It then follows off-site links and goes on to do the same thing at all of the sites that are linked from that original site.  This is really no different than a user sitting at home surfing the web except that it happens at an incredibly high speed.  It is as though it were acting as an agent for the search engine.  Due to its automation, it can quickly create and update its database.  This automation is akin to a robot where it simply does the same repetitious job over and over.  In this case, that job is to build a database of web sites.  Because of these reasons, the actual program or engine that does the work of crawling across the World Wide Web is called an "agent", a "spider", or more commonly, a "robot".

"Isn't that a good thing?"  Well, it can be.  There are many good reasons for using robots.  Obviously, it is very handy to have search engines to find things on the vast online world.  It is even difficult to find documents on your own site sometimes!  The use of robots is not only for going out and gathering up data, but they can be very personal and customized for your own site.  One site can easily get into thousands and thousands of pages, sometimes more.  It is very difficult to find and maintain documents on a site of this size.  A robot can do that work for you.  It can report broken links and help you fill in holes or errors on your sites.

"That's great, I want one!"  Well, before you go jumping into something, think it through.  There are also many drawbacks to using a spider.  Firstly, you have to write the spider engine efficiently so as not to overload your server and also smart enough so that it does not start crawling on other peoples sites and overloading their servers.  If everyone had an agent out there crawling through everyone else's links, the web would slow to a grinding halt!  The most important problem, however, is what I mentioned in the opening.  Spiders will follow links to everything that it sees linked from another page.  That means if you have a link to a personal email, suddenly it isn't personal.  Your company's financial documents may be on there somewhere.  Did you have some naughty pictures that you took and only your husband or wife knew the link to...  Can you say "oops?"

This raises a big concern over privacy, and rightfully so.  Never put anything on the Internet that you don't want people to see.  That is a general word of advice that you should follow regardless of spiders.  But you may have read stories about companies whose internal records are suddenly found floating around on the Internet.  Blame hackers?  Maybe you should blame robots and the administrators who do not know how to control them.  All it takes is one site to start the robot and it begins to follow whatever links it is programmed to follow.  Some employees may link to internal documents.  Some databases may allow spiders to query from them.  You never know who may be linking to what, and by not having a well designed web site, you may have just taken your top secret project and shared it with the world.

So you see, there are some good things and some bad things.  Luckily, there are ways that you can control robots and hopefully limit the bad things.  There is a standard called the "robots.txt" exclusion file.  It is a simple ASCII text file that allows you to tell any robot visiting your site what they can and cannot access.  Here is a sample file:

# robots.txt file for
# last updated: 09/06/2003 by StankDawg
# WTF R U Doing here? R U A ROBOT?
# R U A SPIDER? R U 31337?

User-agent: *
Disallow: /incoming/
Disallow: /downloads/
Disallow: /webstat/
Disallow: /pub/

User-agent: Hackers-go-away
Disallow: /T0pS3cr3t/

User-agent: They-will-never-find-this-one
Disallow: /h1dd3n/

You will notice that there are comments (starting with the "#" sign) and two other important fields.  Proper use of these fields can limit most search engines and spiders that honor the exclusion file.

The first field is called the "User-agent" string.  Each program visiting your website, human or otherwise, is using a piece of software.  For humans, it is called a web browser like Mozilla, Firefox, Konqueror, or dozens of others.  The name of this agent is sent with every page request.  If you look at raw log files from your web server, you can see who visited your site, and what agent they used.  The majority of them will be Internet Explorer since most surfers are using the Windows operating system.  You can look at your logs and find some interesting types of clients out there.  Well, since robots are programs too, they also have an agent string.  In the robots.txt file (which must reside in the root directory of your web server home) you can single out any agent to block it.

The second field is the actual file or directory that you do not want accessed.  The field name you would use is "disallow".  Both the "User-agent" and the "disallow" must be followed by a ":" and then the data that specifies what you want done.  If you want to stop the agent called "googlebot" from accessing the file called "privatestuff.html" you would code the following lines:

# This is a comment above the sample code.
User-agent: googlebot
Disallow: privatestuff.html
Disallow: /images/mysexypics/

As you can see, the syntax is very simple.  What you need to do is think about which things you want kept hidden from which agents.  If you want to hide several different files or directories, you would use multiple "Disallow" lines.  In the example above, I also block access to the entire directory called "/images/mysexypics/" which could have been very embarrassing!  Be careful to realize that this only blocks one agent!   Usually, people do not distinguish one agent from another in practical application.  If something is to be kept hidden, it should be hidden from all agents, not just "googlebot" as in the example above.  One way of doing this is to use multiple "User-agent" strings.  This is never complete and there are always new spiders coming out that would not be on your list unless you constantly update it.  The better way to do this is to simply use wildcard of "*" which tells all agents to follow the subsequent "Disallow" commands.  Along the same lines, you can also tell robots to ignore your entire site by using the "Disallow" string of "/" which will stop the robot from looking at anything!  (Note that you cannot use a "*" wildcard in the Disallow field, you must specify a path.)

# This is a global "stop all robots" example
# Note that comments can be put anywhere on a line,
# and not just above the fields.  They can come after the string.
User-agent: *     # This string stops ALL robots from going into...
Disallow: /       # ANY of the directories

An alternative to using the robots.txt file is to use special "meta" tags in your HTML.  Some people may not be able to create a robots.txt file for one reason or another.  You can also add a <meta> tag in the HTML of every page that you code.  The meta tag name is simply "robots".  This meta tag will allow or disallow robots by using keywords in the meta tag such as "all" to allow it to be included in the search engine or "none" to stop it from being added to a search engine.  There are other options as well, but these should suffice for most users.

Now here is the catch...  (There is always a catch.)  The keyword is "honor" which I mentioned earlier.  While most commercial search engines will currently honor your robots.txt file, it is not a requirement that they do.  It is an optional standard that is not enforced by any agency.  That's right; it's on the honor system.  I am sure that there will come a day when the search engine competition will become so fierce that the engines will begin to index all pages regardless of exclusion requests so that they will gain an "advantage" over other search engines.  Also, you have to realize that anyone can write a spider or a robot!  Since it is optional whether or not they honor your exclusion requests, they may still waltz right through your site ignore all of your "do not enter" signs.  This is the reason that I mentioned earlier that you should never, ever, put really personal, private, or valuable information in a publicly accessible location.  There are many better ways to keep your files safe than robots.txt anyway.

Finally, you should also realize that just because these are intended for robots (or programs) to look at, that doesn't mean that humans cannot look at them as well.  I have found many, many backdoors and "hidden" entrances simply by looking at a sites robots.txt file.  You have full permission to poke around my robots.txt files and maybe you will find some interesting super secret 31337 stuff!

Further Reading:

Shoutz:  As always... my home-dawgs in the DDP, Zearle, Saitou, people who are willing to read and learn, whoever invented the new Reese's "Big Cup," and people who try to use robots.txt files as a substitute for security.