A Method to Spider Sites Like Indeed.com with Teleport Pro

by "ain'tDigitalDATTruth"

Upon trying to spider Indeed.com with standard Teleport Pro settings, the project was failing to retrieve any more than the index file.  I figured that there was an intentional reason why Indeed.com was trying to prevent individuals like myself from spidering their website in order to mine company data.

I prefer Teleport Pro over its only open-source equivalent, HTTrack, because of its ability to analyze forms and because the process of downloading content via threads in HTTrack is rather slow.  I often prefer open-source solutions, and this article will be proof of how I am often supported in my projects by Linux, despite frequently preferring Windows-centric solutions for projects involving data.  A notable exception to this would be the software package RapidMiner, which consists of a combination of commercial and open-source elements.

When I examined the downloaded file, I spotted the problem after some careful evaluation.  The links only consisted of GET variables without including the domain name to which to apply them to.

So, instead of: http://www.indeed.com/index.php&q="example" (which was only constructed for example purposes; this is not actually valid), it was trying to retrieve &q="example", which makes no sense by itself.

The probable solution almost immediately came to mind: that somehow providing Teleport Pro with a means to understand a domain name with its URL requests would resolve the issue.  I was familiar with this sort of technique through "port bouncing."

I had used a Windows "port bouncer" once before a long time ago (which will remain nameless), but I needed something modern. I found one designed for Linux called Barefoot.  I tried to get it to work under Cygwin, but Cygwin is lacking some header files that a full distribution of Linux would have, since it would require someone to code a particularized solution customized for the Cygwin platform to make it function like it does natively on a real Linux platform.  Such a solution hasn't made it into the standard Cygwin distribution yet.

So, I did the next best thing, since my intentions were to use Teleport Pro under Windows: I accomplished integrating and running this port bouncer under a concurrent, virtualized Linux session, and used it directly from Teleport Pro.

Once everything was set up correctly, and resolving an issue with VMware which confuses the concept of localhost as it would function under a non-virtualized session by using the VMware-generated IP address for the virtual Ethernet connection by specifying its real IP address as it is listed under ifconfig, spidering Indeed.com worked like a charm.

But all was not solved.  Indeed.com apparently has good automated firewall rules in place, since the spidering session for my first query only lasted about five minutes before Teleport Pro's retrieval threads were stagnated.  Issuing a different query allowed further transfer from Indeed.com, but the same stagnation problem prevented complete retrieval of the site.

Regardless, I am sure this obscure security implementation is used on other sites, and it stands, by itself as no reason to prevent one from spidering such a site.

Return to $2600 Index