Rewriting History

by Steffen Fritz  (sfnfrz2600@gmail.com)

0x0: Web Archiving

With the growth of the World Wide Web and its increasing cultural and political influence, the archiving of web published content became an important matter for preserving cultural heritage.

Public institutions like the Library of Congress (LoC) in the United States1 or the Bibliothèque nationale de France (BnF)2 and non-profit organizations like the Internet Archive (IA)3 are doing a great job in this.

While the LoC or the BnF don't crawl the whole web - they curate, collect, and preserve topic, event, or domain specific - the IA takes them all, automatically.  At least they try.  Other services like Archive.today or Webrecorder allow users to manually mirror web pages and see the results right away.

Whoever is preserving has three possible archiving methods: transactional archiving, database archiving, and remote harvesting.

The most common one is the latter and the idea is fairly simple: Copy a website, search the source code for URLs, copy the referenced resources, and repeat recursively until you hit a termination condition, e.g., no new web resource found or when leaving the domain.

A program doing this is called a web crawler.  Popular tools are Heritrix4 and HTTrack5.

HTTrack saves files as a web server delivers them, e.g., image.jpg as image.jpg and index.html as index.html.

Heritrix creates web archives according to the WARC file format, which is the de facto standard for web archives.

0x1: WARC Format

The WARC file format defines how to store payload content, control information, and arbitrary metadata as blocks together in one file.

Control information like DNS and HTTP requests and responses make the crawl comprehensible.  Hash sums, dates, and file sizes describe the digital objects.  Each WARC record in a WARC file is initiated by "WARC/1.0" and consists of a record header that describes the type and content of the record.

It is followed by the content and two newlines.

You can create a WARC file with GNU Wget versions 1.14 and great.

Just add the switch: --warc-file=outputfile

$ wget --warc-file=2600 http://2600.com
Opening WARC file '2600.warc.gz'.

--2023-12-16 22:07:02--  http://2600.com/
Resolving 2600.com (2600.com)... 166.84.5.162
Connecting to 2600.com (2600.com)|166.84.5.162|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://2600.com/ [following]

     0K                                                       100% 16.1M=0s

--2023-12-16 22:07:03--  https://2600.com/
Connecting to 2600.com (2600.com)|166.84.5.162|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'index.html'

     0K .......... .......... .......... .......... .......... 92.1K
    50K .......... .......... .......... .......... ..........  173K
   100K .......... .......... .......... .......... ..........  227K
   150K .......... .......... .......... .......... ..........  131K
   200K .......... .......... .......... .......... ..........  206K
   250K .......... .......... .......... .......... ..........  205K
   300K .......... .......... .......... .......... ..........  209K
   350K .......... .......... .......... .......... ..........  255K
   400K .......... .......... .......... ......                 222K=2.5s

2023-12-16 22:07:06 (173 KB/s) - 'index.html' saved [446923]

$ ls -l 2600.warc.gz
-rw-r--r-- 1 root root 305903 Dec 16 22:07 2600.warc.gz
$ gunzip 2600.warc.gz
$ ls -l 2600.warc
-rw-r--r-- 1 root root 453565 Dec 16 22:07 2600.warc

Wget creates a 2600.warc.gz file.

Unzip (gunzip) it and open the WARC file with an editor like Vim or Emacs.

The first block in the file describes the WARC file itself.  The following blocks are related to network traffic and payload.  The fields in the blocks have a simple named fields structure, terminated with CR+LF.  An important field is "WARC-Target-URI".  It is identical to the source URI and therefore it also determines the file name of the payload.

Let's have a look at an example.  Some lines are omitted.  All blocks are from the same file.

We investigate three blocks:

WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2023-12-17T04:07:02Z
WARC-Record-ID: <urn:uuid:41865a03-e428-4486-b8c6-4692879cbd03>
WARC-Filename: 2600.warc.gz
WARC-Block-Digest: sha1:OYTLHM5FGHNW3P4HZVYT5WFHGSTUGDTC
Content-Length: 218

software: Wget/1.21.4 (linux-gnu)
format: WARC File Format 1.0

The above is the first block in our WARC file.

It is an info block and contains "warc-fields".  The content, i.e., the following two lines, has a length of 218 bytes.  The second block is a request block in which the network communication for a single request is logged.

WARC/1.0
WARC-Type: request
WARC-Target-URI: <http://2600.com/>
Content-Type: application/http;msgtype=request
WARC-Date: 2023-12-17T04:07:02Z
WARC-Record-ID: <urn:uuid:427eb9af-9ca9-412b-9054-90c8cc20fc29>
WARC-IP-Address: 166.84.5.162
WARC-Warcinfo-ID: <urn:uuid:41865a03-e428-4486-b8c6-4692879cbd03>
WARC-Block-Digest: sha1:K6YKOA2QYQMHTPX7PAFMHGJSOIT7O7FB
Content-Length: 123

The third block contains the response from the server.  After the WARC fields and the metadata, you can see the HTML payload.

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:ea9ad78c-168f-4df5-89a1-ef86277479a8>
WARC-Warcinfo-ID: <urn:uuid:41865a03-e428-4486-b8c6-4692879cbd03>
WARC-Concurrent-To: <urn:uuid:427eb9af-9ca9-412b-9054-90c8cc20fc29>
WARC-Target-URI: <http://2600.com/>
WARC-Date: 2023-12-17T04:07:02Z
WARC-IP-Address: 166.84.5.162
WARC-Block-Digest: sha1:4MH5SAHSMIWNE3P2ZMLLIJNQWFAXPTDX
WARC-Payload-Digest: sha1:3QMWS3DYDOGG2KNJUSF5PYMIJTPPQQD6
Content-Type: application/http;msgtype=response
Content-Length: 390

HTTP/1.1 302 Found
Date: Sun, 17 Dec 2023 04:07:02 GMT
Server: Apache
Location: https://2600.com/
Content-Length: 201
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://2600.com/">here</a>.</p>
</body></html>

For a full description read the specification.  The ISO draft is available at the BnF and well readable.6

At the time of this writing, the International Internet Preservation Consortium (IIPC) is working on version 1.1 of the specification - pretty transparently on GitHub, by the way.7

What to do with the WARC file?

Replay it.  There are a few tools to render the archived content.  One is the (Open) Wayback Machine you may know from the Internet Archive.  Another one is Pywb, which I prefer for local testing because it is pretty easy to set up and much lighter.8,9

Whatever you use, you set up a data storage for the WARC files and the tool of your choice serves the content, rendered by a browser.

Suppose we are using the Wayback Machine on localhost and the above example with WARC-Target-URI 2600.com.

You would open the URL:

http://localhost:8080/web/20231217040702/http://2600.com

And you'd see how the website 2600.com looked like in December 2023.  Do you see where this is going?

Let's assume we could create WARC files with arbitrary content.  And let us assume further we could manage to inject that file into a trustful archive and that we could share a link with Alice and Bob: Both might be tricked into believing a website looked like something it never did.  Let's call it "post defacing."

0x2: Create a WARC File and Make Bob Trust It

Of course, you could create a WARC file with a text editor.

But the creation of hash sums, length of content, etc. might be a little bit annoying.  You could also set up an environment to crawl a fake site.  I decided to write a Python script to create minimal, valid WARC files.10

You call the script:

$ python html2warc $URL $SOURCE $TARGET_FILE

$URL is the root value for the WARC-Target-URI field, $SOURCE must be a directory with the desired content, and $TARGET is the name of the WARC file.

A proof of concept WARC file can be downloaded from GitHub.10

You can upload that file to Webrecorder and watch the result.  Fascinating, isn't it?

Well, Webrecorder isn't an archive and the service explicitly states that.  But are Alice and Bob aware of that?  Checking the trustworthiness of sources isn't a standard procedure in online communication.  Sadly.

To upload the file to Archive.org and trick Bob, things are a little bit more complicated.  You can upload a WARC file with an ordinary user account into a collection.  But then it is stored as the mediatype "texts" and can only be downloaded again as a WARC file.  If you try to change the web memory for a specific site, you have to convince a member of the Archive Team to copy your WARC into their collection and change the media type from "texts" to "web".

Obviously, it is possible to steal the archive login from a member and do it yourself.  No doubt, some Mallorys are trying to do this.

Remember:  It is not about defacing a web site.  It is about changing the political, cultural, and social memory.

0x3: Impact and Responsibility

Putting false documents into trusted archives is not a new threat.

In 2005, the British National Archives detected faked documents, claiming that Heinrich Himmler was murdered in custody.  And in 1967, Gérard de Sède wrote in his book Le trésor maudit de Rennes-le-Château that a guy named Pierre Plantard is a descendant of Dagobert II and therefore the one and only King of France.

De Sède referred to documents found in the National Archives in Paris.  Placed there by, you guessed it, Pierre Plantard.  You may read on this very interesting case by searching for the "Plantard Dossiers."  I am pretty sure that faked documents have rewritten history and they will in the future.  Web archives are just another playground.  But an important one.

Who's responsible?

Surely, archives have to check their objects and they are responsible for the data they provide - be it books, birth certificates, or web archives.

But in my humble opinion, users also have to check their sources and should not automatically trust something because of its outer packing.

Remember that Trojan Horse?

Notes

  1. LoC: Web Archiving
  2. BnF: Digital Legal Deposit
  3. Internet Archive
  4. Heritrix
  5. HTTrack
  6. The WARC File Format (ISO 28500) - Information, Maintenance, Drafts
  7. WARC Specifications
  8. OpenWayback
  9. Webrecorder Pywb 2.7
  10. html2warc  Creates WARC files from local web resources.
Return to $2600 Index