Peeling Grapes

by Bryan Elliot

There are many reasons to want to map the archives of a website.  Most of them involve instant and offline access to cool stuff with no advertisements.

The important thing to remember here is that you want to peel the site, not rip it.  The distinction here is simple - peel the website and you allow other people to use it, and usually don't end up making their ISP have a coronary.  Rip the website and you've cost the makers of stuff you like a good deal.  You may have also cost them adviews; when you're utilizing all your bandwidth to tear at theirs, you may keep others out.

So, as a precaution, remember to keep the bandwidth controls on your software.  I mean, you don't want your favorite public domain MP3 site going down when you suddenly pull ten gigs (a lot of money in bandwidth terms) worth of stuff in a little over a day, right?

Watch Your Language

I've been criticized for loving PHP.

People tell me it's not a real language, it's for pussies, and such.  All I have to say to them is, piss off.

PHP is well designed for what it is: a brilliantly souped up data processing language.  It's got simple interfaces for network connectivity, file access, Win32 API functionality, the wonderful PCRE libraries, and it makes quick and dirty development a joy.

If you think I'm a puss for that, then I can only say "Mee-oww, baby."

Say, for example, you're a comic connoisseur.

Megatokyo, an excellent webcomic, has their comics serially numbered, from zero to whatever comic is currently listed on the home page.  That's a simple chore to write code for.

The pseudocode goes something like this:
Open www.megatokyo.com, port 80, send "GET / HTTP/1.0<CRLF><CRLF>" (standard dumb browser request)

Parse out today's comic image name

Figure out how many we must get to be up-to-date from previous attempts

For last_saved+1 to current:
  Open connection
  Send HTTP header
  Check response for error
  If response = 200, save the image
See?  Easy.

Why's This Grape Shaped Like a Stapler?

Well, it's not always easy.

See, Megatokyo is a bit of an exception in comic bookkeeping.  Penny Arcade, for example, works on a date and scripting system.  What method are we to use to get around this?

Quite literally a different method indeed.

We still count past all the possible dates, but instead of using GET, we use the HEAD HTTP method.

For example, a good "idiot light" for a webserver is to Telnet to port 80 and type in HEAD / HTTP/1.0 and hit Enter twice.  If you get 200, you're O.K.

So, the new pseudocode is:
Get today's date.

Store November 18, 1998 somewhere.  Since this is Penny Arcade, your ass would be an appropriate spot.

Check to see if we have already got some Penny Arcade.  

If so, get the most recent date we've downloaded, add one, and replace Nov 18, 1998 with it.

For last_date to today:
  Send HEAD request (keep connection alive; might as well with all we'll be doing here)
  If response is 200, send an equivalent GET request
  Save the image
Right.  Just so's you know, it's going to be a little different each time you do it.  I'm just trying to teach you the necessary skills for website peeling.

New Tasks, Closing Arguments

Now, sometimes you'll have to have your program selectively pick images from a webpage, choosing content, but avoiding stupid things, like adverts and buttons.  This is where PCRE matching comes in.

For example, the Page3.com, softcore porn it is, is a fun page to try ripping.  Twenty some girls, an average of 60 some pics of each girl.  And, being the manly hacker-type you are, you must have every image.  All of 'em.

So?  As said, you can make use of PCRE, or Perl Compatible Regular Expressions.  In PHP, it's built in, and in C/C++, there are libraries and DLLs for you to use, and in Perl... well, they're called Perl compatible for a reason, ya?  Use whatever you prefer.

I was going to post up the code for this process, but quite frankly, I'm at work, and pulling up softcore porn, while fun to do at home, is not the smartest thing to risk having your coworkers see.  As such, I'll let you do the research and exercise yourselves.

I'll leave you with links to the relevant documentation.

www.php.net - PHP: a nice handy language for the starting programmer.

www.cs.virginia.edu/~lcc-win32 - A lovely ANSI C Compiler for Windows programming.

www.pcre.org - PCRE: the DLLs and documentation, and everything you need to know about PCRE.  You must welcome the headache.

Just a quick note on PHP: If you want to try it out, get the multibyte package.

You can't play with all the cool functions without it.  Additionally, an easy way to find stuff is to simply put your search terms after the initial slash.  I'm serious here.

www.php.net/preg_match will get you the docs for the preg_match() function.

Just remember to keep it down to one connection at a time, please.

Return to $2600 Index