Writing Bots for Modern Websites

by Michael Morin

Writing "bots" for crawling or manipulating websites used to be as simple as requesting HTML pages from a web server and parsing the HTML.

However, modern sites (or "web applications") often require JavaScript to function.  Instead of trying to integrate JavaScript into your bot, you can use Watir (pronounced "water"), a Ruby library for controlling web browsers.

Watir is available on all major platforms and its various flavors (which include Watir, FireWatir, SafariWatir, and Watir-Webdriver) can control all the major browsers.

You'll need a working Ruby installation with C compiler.  I recommend Ruby Version Manager (RVM) on Linux or OS X, or RubyInstaller with the DevKit on Windows.  You can then use the gem command to install a flavor of Watir.

Another thing you'll need is a browser with a good DOM Inspector, like Firefox 4, Firefox with Firebug, or Chrome.  "View Source" isn't going to work here.

Once you get up and running, using Watir is pretty easy.

This example program will open up Google and search for Watir:

example-1.rb:

require 'rubygems'
require 'watir-webdriver'

b = Watir::Browser.new :firefox
b.goto 'google.com'
b.text_field(:name,'q').set 'Watir'
b.button(:name,'btnG').click

That's not too exciting though.  Let's open up Digg.com (like it or not, but it uses a lot of JavaScript), log in, go to the top news stories and Digg the top one.

example-2.rb:

require 'rubygems'
require 'watir-webdriver'

b = Watir::Browser.new :firefox
b.goto 'digg.com'

b.link(:text,'Login').click
sleep 1 until b.text.include? 'Login to Digg'
b.text_field(:name,'ident').set 'your username'
b.text_field(:name,'password').set 'your password'
b.button(:text,'Login').click
sleep 1 # May need kajiggering

b.link(:text,'Top News').click
b.divs(:class,'story-item').first.link(:text,'digg').click

You can see here why this can be so tricky.

When you go to Digg and click "Login", you get a new login form in the middle of the page that wasn't part of the original HTML returned by the first HTTP request.  This is referred to as "Ajax."

The server is returning new bits of HTML and the page is inserting them into the DOM tree.  This is what makes writing bots without JavaScript so hard these days.

You can also see some challenges in writing bots with Watir.

It just takes some kajiggering, like sleeping at certain points and waiting for some text to appear on the page.  Trial-and-error is in order here and you'll get a feel for when waiting is needed.  Each site acts differently, and sometimes you just have to try putting in different wait times and looking for different text to show up in the body.

With just this short intro, you should be well on your wait to creating your own bots.  Aside from the kajiggering, it's easy to do.  Using Watir for bots will work on any site, no matter how much obfuscation and countermeasures they use.  If you can go there and click on these things yourself, there's very little they can do to stop these bots.

Here are some other things to think about.

Bots often try to hide themselves by passing realistic user agents and other headers, but they can be found by examining server logs.  It's pretty suspicious if all one user does is log in, go to the top news, and immediately vote the top link up.

You can hide a bot by having it act more like a human.  Wait random times to simulate reading, click on other links (that it makes sense to click on), wait some more, then perform the task needed.  That would be extremely difficult to detect.

This still doesn't get around CAPTCHAs (those annoying scrambled letters).  However, those usually only appear on registration forms.

Depending on the site, this may or may not be a problem.  There are also some libraries around that can read these.  However, they're usually purpose-built for certain sites and won't work on the really good ones anyway.

By itself, Watir won't work with other technologies embedded in the page such as Flash, Java or Silverlight.  There are some projects such as flashwatir to solve this, but support is pretty thin.  They may or may not work for you.

You can get and store the entire text of a web page in its current state by using the "text" method.  This can be used to store entire pages for mirroring purposes, or be parsed more carefully with libraries such as Nokogiri.

Here are some ideas of what you can do with this.

Make smart bookmarks:  I've often tried to bookmark things, but because they use JavaScript and POST requests and other un-bookmarkable things, you can't use a normal bookmark.  You can use Watir to open up the page for you though.

Provide your own API for a site:  Many sites provide an API for you to use, but you won't need one.  You can use the site directly.  Wrap this up in your own API and it'll be even easier to write your own bots for the site.

Automate common tasks:  Continuing with the Digg example, what if you wanted to automatically Digg any story with the word "Ruby" in the title?  Set this to loop and watch new stories, and it'll spread the Ruby love without you lifting a finger.

Mischief:  We've been dancing around this subject for the entire article.  I'll leave it up to you as to just how mischievous you want to be, but the possibilities are endless.  Though if you're up to something really mischievous, maybe you should throw Tor into the mix!

Return to $2600 Index