Format De-Shifting

by Peter C. Gravelle  (peter.c.gravelle+2600@gmail.com)

Have you ever clicked on a link expecting a PDF, even seeing .pdf in the location bar, but instead of your friendly PDF viewer, you see a vaguely familiar interface, but with the "Print" and "Download" buttons removed?

Then it's likely that PDF.js1 is involved.  But worry not, we can get you that file anyway.

Background

My wife is an apprentice plumber and, as an apprentice, has to take courses to learn her trade.

This particular course was on the New York City Construction Code Plumbing Code.2  Construction codes are made available by local jurisdictions for many reasons, including inspections, educating tradespeople, and a general commitment to transparency in government.  Previous versions of the code were available in the PDF format.

However, for the 2014 edition, the New York City Department of Buildings decided to use a new piece of HTML5 tech: PDF.js.

PDF.js is a PDF viewer written in HTML5 by the Mozilla Foundation.

This means that any device that supports modern web standards and runs JavaScript can view PDF files.  This is a big boon for a lot of reasons.  The biggest one is mobile browsers with limited plugin support can view PDF files without mangling their formatting much.  Another benefit comes to desktop browsers: many PDF viewing plugins are very slow to load and are very resource intensive (Adobe Acrobat, for one).

Finally, PDF.js allows the content provider to (ineffectually, it turns out) block saving and printing the PDF in question.

Construction folks, as a class, are fairly technologically conservative, and do not appreciate change.

In this particular case, my wife's instructor wanted to turn to sections of the code in class, but could not, as they didn't have an offline copy.  Network access can be iffy on construction sites, so it's good to be able to keep a local copy in that case as well.

The instructor issued a challenge to anyone in the class who could get PDF copies for him.  My wife took up the challenge, but did not want to simply "Print to PDF," as this would kill the anchor links.  She reached out to me, and we began our investigation.

What Tipped Me Off

A few clues made it seem like this was possible.

The first thing was the URL itself, which included a reference to a filename ending in PDF3 as well as several references to: /pdf_viewer/

Second, when I inspected the HTML code itself, I found each paragraph in <div> tags with very precise data-canvas-width attributes - out to over ten decimal places.  No human would ever write that!

So I took a look at the various <script> inclusions and eventually found a reference to PDF.js and the Mozilla Foundation.  A little quality time with a search engine and the framework's documentation, and I stumbled my way into three possible methods of downloading the original PDF file.

Method 1: Ask Where It Got the File

My first method was the most direct.

The documentation for PDF.js made it clear that most of the magic that happened took place in the PDFView object.

So I opened up the JavaScript console in Chrome (or Firefox's Web Console) and looked at the various children of the PDFView object.

A quick glance gave me: PDFView.url

Copy that URL out and put it into a new tab, and down comes the file!

Method 2: Watch It Get the File

Since the PDF viewer runs in JavaScript on your browser, the PDF is being sent directly to you.  Wouldn't it be handy if you could catch it in flight?

Well, these browser consoles also have a lovely tab called "Network."  Select this tab and run the Network tool, and you can watch the files in flight.

In the case of Chrome, you need to reload the page.  In Firefox, you click the button to start the process.  In Chrome, the PDF is immediately visible and you can click it to download it.  With Firefox, I had to click on the XHR tab to grab the PDF link.

Method 3: Just Ask It for the File Directly

This third method is an excellent demonstration of the value of looking at all options first.

Another child of PDFView is the function: download()

Guess what happens when you run PDFView.download() in your browser console?  Yep, the PDF file is immediately added to your download manager and into your downloads folder.

Conclusion

PDF.js is an excellent tool for a lot of things, including making PDFs more palatable on mobile devices.

But sometimes you want the original format.

And things like the construction code of your city is yours by right to read in whatever format you want.

If you put all the smarts in the browser, then the browser has ultimate control!

Acknowledgments

In all things, I'm grateful to my wife, who puts up with my dicking around with tech until far too late at night.  Thanks to the Mozilla Foundation for making PDF.js, a great tool with excellent documentation and a lovely backdoor of sorts.  Best to faboo, TecknicalTom, and the Neg9 crew, and all those around the world yearning to be free.

Links

  1. PDF.js: mozilla.github.io/pdf.js
  2. New York City 2014 Construction Code - Plumbing Code: www.nyc.gov/html/dob/html/codes_and_reference_materials/2014_cons_codes_table_of_contents.shtml#plumb
  3. NYC CC PC Chapter 1: Administration: www.nyc.gov/html/dob/apps/pdf_viewer/viewer.html?file=2014CC_PC_Chapter1_Administration.pdf§ion=conscode_2014
Return to $2600 Index