Web Scraping Scripts

by Patrick Hemmen

Hello from Germany!

I have read the article "Scrape Textbooks, Save Money" by thOtnet in the Autumn 2017 issue and was impressed by the creative solution for a problem.  I had a similar issue with documentation from a training course of a big network equipment company.  They provided a lot of documentation during the training, but you can only read it with their special software.  With this software you can't copy anything from the document to your clipboard.  They added the ability to print pages, but only a certain amount.  If you hit the maximum of allowed pages to print, you can just make screenshots of it.  I have used the script from the above mentioned article as a basis for my own script to easily copy the interesting pages.  This is the kind of article I love to read in 2600.  The other type is about details of infrastructures in other countries (e.g. telephone network, Internet, or anything else) - thank you "Telecom Informer!"

In Germany we have a lot of public libraries from universities or local authorities in which you can lend books or magazine for free or for a very small yearly fee of around ten euros.  It's also possible to get a book from another library if your local one doesn't have it.  The other library will send the book to your local library for a small fee of two euros.  I use these kinds of libraries a lot to get the newest novels or magazines and save some money.

Some years ago, the local authorities library introduced the ability to lend digital media like ebooks, magazines, or audio books.  Not every small local authorities library can operate such an expensive digital media library by themselves, and therefore a lot of local authorities libraries get together and build a shared digital library.

Two main digital libraries are in use in my state of Germany (Lower Saxony): Lies-e and Onleihe (combination of Online and Leihe - lend).  My local authorities library is part of the Onleihe which they named NBib24 (Niedersachsische Bibliotheken 24 Stunden online - Lower Saxony Libraries 24 hours online).  The digital library is a service made by divibib GmbH.  Unfortunately, the whole system has a lots of bugs and they must use some kind of DRM to prevent easy sharing and enforce the duration of lend.

The DRM comes from Adobe and I have to use Adobe Digital Edition to download the digital media.  Also, the number of available copies of digital media is limited.  Sometimes you have to wait some days or even weeks for new popular books or magazines until you can lend it.  Magazines are allowed to lend for one day and usually one or up to five copies of the magazine are available at the same time.  To be able to lend the magazine as soon as possible, it's a good idea to lend the magazine quickly after it appears in the online database of the digital media library.  It's a boring task to check every day or even every hour for a new issue of your favorite magazine.  For this reason, I have created a small shell script which searches the online database of the digital library for the magazine and sorts it by newest arrival.  If a new issue is available, it will send a push notification to my smartphone and I can lend it.

#!/bin/bash
NAME="AD"
CHECKFILE="/mnt/nbib24_ad_temp.html"
NEWFILE="/tmp/nbib24_ad_new.html"
DIFFFILE="/mnt/diff_ad.txt"

CURL="curl 'http://www1.onleihe.de/nbib24/frontend/simpleMediaList,0-0-0-109-0-0-0-2004-0-362651610-0.html#titlelist' -s -H 'Host: www1.onleihe.de' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:59.0) Gecko/20100101 Firefox/59.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: de,en-US;q=0.7,en;q=0.3' --compressed -H 'Referer: http://www1.onleihe.de/nbib24/frontend/simpleMediaList,0-0-0-109-0-0-0-0-0-400750299-0.html' -H 'Content-Type: application/x-www-form-urlencoded' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' --data 'SK=2004'"

# If $CHECKFILE is a vailable, download current and check against checkfile

if [ -f $CHECKFILE ]; then
  $CURL | grep '>Titel:' > $NEWFILE
  diff $CHECKFILE $NEWFILE > /mnt/video/$DIFFFILE
  RESULTDIFF=$(diff -q $CHECKFILE $NEWFILE)

  if [ $? -ne 0 ]; then
    /usr/local/bin/push.sh "Nbib24 ${NAME} Match"
        cp $NEWFILE $CHECKFILE
  fi

  # otherwise download new and send push
else
  $CURL | grep '>Titel:' > $CHECKFILE
  /usr/local/bin/push.sh "Nbib24 ${NAME} Match"
fi

# Delete temp file
if [ -f $NEWFILE ]; then
  rm $NEWFILE
fi

The European Commission releases every week on Friday the latest product warnings as the Rapid Alert System for dangerous non-food products (ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/repository/content/pages/rapex/index_en.htm).  I was on their mailing list and looked at the picture on the website for products I have bought.  It takes some time to scroll the whole webpage on my smartphone.  To make my life easier.  I have created a web scraping script which download the website and extracted the URLs of the pictures.  These URLs are then sent as an HTML email to me.  With this email, I can quickly check the products and, if I have such a product, look up the details about the warning.

#!/bin/bash
OVERVIEW_URL="https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=main.weeklyReports.XML"
TEMP_OUT=/tmp/temp
TEMP_MAIL=/tmp/temp_mail

# Download XML
curl -s -o $TEMPOUT $OVERVIEW_URL

# grep newest
CURRENT_URL=`grep -A 1 '<URL>' $TEMP_OUT | head -2 | tail -1`

# Download newest
curl -s -o $TEMP_OUT $CURRENT_URL

# grep Images
IMG_URLS=`grep -A 1 '<picture>' $TEMP_OUT | awk -F [ '{ print $3 }' | sed 's/.\{3\}$//' | sed '/^$/d' | sed 's/\(.*\)/<img src="\1" alt="\1"\/>/'`

# generate html template
cat > $TEMP_MAIL <<EOL
<!DOCTYPE html>
<html>
<head>
<title>European Commission - Rapid Alert System - Weekly Reports</title>
</head>
<body>
$IMG_URLS
</body>
EOL

# sendemail
mail -a "Content-type: text/html;" -s "European Commission - Rapid Alert System - Weekly Reports" my@email.com < $TEMP_MAIL

# clean up
rm $TEMP_OUT
rm $TEMP_MAIL

A smaller web scraping script downloads Internet radio episodes from a public station in Germany late at night.  I can then hear it the next morning during my commute to work.

#!/bin/bash
wget -o /mnt/ndr.mp3 http://ndr-ndr2-niedersachsen.cast.addradio.de/ndr/ndr2/niedersachsen/mp3/128/stream.mp3 &
echo $!
PID=$!
echo $PID
# second show long I record the stream
sleep 2700
kill $PID

All these scripts are written in Bash and use standard UNIX tools.  They are quick and dirty and I need usually 15 to 60 minutes for each.  They all run on my Raspberry Pi 2 with Cron.  For push notification, I use the great service from Pushover.  As a starting point, I often open the web page with Firefox Developer Tools.  In the Developer Tools, I use the Selector feature to see the HTML code for a specific part of the website and the copy URL as Curl command in the network analysis tab.

Code: magazine.sh

Code: rapid-alert-reports.sh

Code: german-radio.sh

Return to $2600 Index