Abusing Metadata

by ChrisJohnRiley

What is Metadata?

Metadata, coming from the Greek word "meta," meaning about, is a rich source of information that is stored within the structure of a file when it's saved.

This information can include details about the author of the document, date of creation, path information, and which application was used for creating the file.  It can contain a host of potentially useful information to the average bad guy or generally curious type.

How Can You See the Metadata?

Windows:  There are a number of ways to view the metadata contained within files.  Under Windows the easiest way to view simple metadata is to right-click on the file you're interested in and select properties.  It seems simple, and it is.  Although, that said, this won't work for every type of file and won't give you all the information you might want.  With specific file types you'll see a "Summary" tab that will include some basic details.  This information will vary depending on the file type.  Some file types will provide nothing more than a time/date stamp and others will want to tell you their life story, so to speak.

Using Microsoft Office documents as an example, you should see some basic statistical information about the document (number of words etc.).  Underneath this you'll find the creation date, last edited date, as well as hopefully some information about the author and the name of company the software is registered to.  In some versions of Microsoft Office, you'll also be able to see the exact version of software used to create/edit the file.

Looking at PDF files will also provide a wealth of information.

When you look at the properties of a PDF file, you'll likely see a "PDF" tab that contains specific information about the creation of the PDF document.  As with Microsoft Office, this tab should contain the version of software used to create the file, as well as the usual creation information.  Information about the Author is optional here, and isn't usually automatically entered (unlike Microsoft Office, which will populate this field from your user settings).  If you want to go deeper into metadata you can pick-up a number of third-party tools that will extract the information from documents for you.  Tools like Metaviewer and MetadataAssistant can gather together all this information into a single location.

Linux:  Under Linux, the options for extracting metadata are a lot more flexible than they are under Windows.  After all, isn't everything more flexible under Linux?  ;)  At a basic level, you can use the strings command to examine one or more files for human readable strings contained within the file.  This will give you a long output, most of which isn't going to be particularly useful for you.  However, hidden somewhere in this list of strings you'll usually find the same information that I alluded to above.

The power of Linux, however, is that you can take this output and search it for specific strings.  For example:
$ cat file.pdf | strings | grep -i adobe
Will search file.pdf for any strings matching the word "adobe".

This should output a number of strings and hopefully the version of software used to create the file.  You can fine tune this simple search function to look at multiple files, or search for other strings very easily.

As with Windows, you can also install a number of third-party tools to make metadata searching easier.

Running a quick search on your distributions software list should pop-up two or three options.  Personally, I'd start by looking at the extract tool, as this should offer what you need from a command line and should be easy to find in your package manager.  Command syntax couldn't be easier: extract

You can use the -p option to set a specific metadata field that you want to see.  For example:
$ extract -p creator test.doc
Will output just the creator data associated with the test.doc file.

What About Image Files?

Good question, I'm glad you asked.

Image files can, and usually do, provide information that can be very informative.  Unlike the document types we covered above, you'll probably need to install a specific application to get at the really interesting data stored in image files.  You can get basic text output from the extract tool.  However, if you by search for "EXIF" on Google, you'll come across a number of command line and GUI applications that will do a little more for you.  Personally, I use the ExifTool application written by Phil Harvey (sometimes with the ExifTool GUI, if I'm feeling really lazy).

If you're on Linux, you can get the libimage-exiftool-perl module direct from your repository.

For Windows users, you can get an installer from Phil Harvey's website (see links below).  Image files include a number of EXIF tags that contain a wealth of information about the type and model of camera used to take the picture, as well as thumbnail information and even GPS data, if the camera is fitted with one (like the iPhone, for example).

The thumbnail information can be useful depending on the way the picture has been edited.  If a thumbnail image isn't re-created after editing, then the thumbnail will represent the original picture and not the edited, cropped, or touched-up final version.  Using the ExifTool you can easily export this data by typing:
$ exiftool -b -ThumbnailImage image.jpg > image_thumb.jpg
This has been used more than once with embarrassing results.

Search for "Cat Schwartz exif" or "Meredith Salenger exif" for more information (not safe for work).  There are many more possibilities here, so your best bet is to check out the ExifTool documentation.

What Else Can You See?

It's common in business to work in teams when creating specific types of documents.

Collaboration is a big thing for companies like Microsoft, especially when it comes to the marketing team needing to make changes to public documents.  The back-and-forth goes on until the final document is completed.  Within Word, the information for each revision of the document is stored unless it's specifically stripped from the document.  Using various methods, it's possible to view information on the revisions that took place within the document (if they've not been cleaned prior to publishing).

A prime example of this is the research done by Michal Zalewski back in 2004.  He wrote an article about data stored within Microsoft Office documents.  The article can still be read on his website, along with a (now) outdated tool called the Revisionist (therev.tgz) that extracts the revision information.

I'll not rehash the contents of the story here, as we're all more than capable of clicking a few links.  However, suffice it to say, it was a little embarrassing for Microsoft to have the revision history of their publicly available documents, including writers' notes and changes, exposed on the Internet.  Microsoft quickly got the message and began cleaning metadata from the files it uploaded.  Other companies, though, don't seem to have gotten that clue just yet.

I regularly find metadata in files when performing penetration tests.

This information can be extracted and used to our advantage.  In order to see this information, you can open up the files in Word and select to review all revisions through the collaboration options.  This can obviously get a little long winded if you're searching an entire website's worth of data.  The Revisionist tool was designed to do this automatically on entire directories.  However, time has moved on since the tool was first made and running it on more recent documents results in error.  This just means that we need to break out the trusty Linux toolbox and take a look.

Using Office 2007 as an example, we can take the DOCX file and expand it using unzip (DOCX is a container and not just a document, after all).  Once expanded, you'll find the collaboration information in the "./word/document.xml" file.

You can see additions and deletions based on their XML tags.

For example, deleted entries are surrounded by delText tags.

You can easily find these in Linux using:
$ sed -n -e 's/.*$.*$<\/w:delText>.*/\1/p' document.xml > deleted.out
Comments can similarly be found by looking at the ./word/comments.xml file.

Why is all this Useful?

Why should you care about this information?

Well, there are a number of reasons.  Obviously, for one, we all value our privacy and nobody likes to think that a document we've written will contain possibly sensitive information about us.  Taking it from another point of view, however, as a penetration tester, metadata is a treasure trove of useful information.

Simply finding a few PDF and Word documents on a website could give me enough information to launch a focused, client-side attack.  I'll run you through the process, step-by-step.

After gathering some files from the target company (possibly using a Google search such as "site:target.com filetype:pdf"), I can run the PDF files through strings/extract and isolate the information that I want.

Not all files are going to contain useful data, so it's best to check multiple files from various sources (website, emailed press releases, etc.).

Following our example to the next stage, I can see from the metadata I've extracted that the company is using Adobe Acrobat Professional 8.1.2 for Windows (this is listed in the metadata as the product used to create the PDF files).  I also find the full names of several authors who wrote documents for the website.

The final piece of information is the document creation date.

From the creation date I can see that they wrote the documents last month, so the information I've extracted from the metadata is relatively current.  After all, no point in using outdated data.  Armed with the name of the author and the content of the documents, I call the company reception (probably late evening or lunchtime, in the hope that my target is away).  Using the information I've gathered, I simply ask for the email address of the target, so that I can forward him a new revision of the document for consideration.  Simple request, nothing too heavy.  Maybe you can even skip this step if you can determine the email address based on other information gathered from the Internet.  Google hacking is your friend here.

Now it's time to write him an email.

Taking one of the PDF files I examined earlier, I edit it to insert a client-side exploit.  Adobe Acrobat 8.1.2 has a known flaw that can be exploited using malformed PDF files.  I won't go into how to achieve this here, as that's not the point of this article.  From here, it's a simple case of writing a believable email that is convincing enough to get him to open my version of the PDF.  As you're targeting a specific individual or group, this shouldn't be too hard to achieve.  With this done, it's time to sit back and wait for the exploit to run.  What happens from here is up to you.  Without the valuable metadata, this attack would have been a lot harder to achieve.  There would have been no specific target information and no idea which client-side exploit could work.  Of course, metadata didn't make this user vulnerable.  He was always vulnerable.  It just made things easier for us to exploit.

Can I Remove Metadata?

If you want to remove the metadata stored in your documents, there are a number of options.

Microsoft has released an add-in for Office XP/2003, as well as building a feature into Office 2007 to clean metadata from files.  Both of these options will strip specific metadata from Microsoft Office files as you save them.  Adobe has also begun incorporating metadata removal into their latest versions.  There are also a number of third-party tools on offer, like iScrub and 3BView, that do the same job.

If you're looking to ensure that all of your files are metadata free, then the third-party offerings are probably where you'll find the best options.  The Microsoft solutions, although handy, do little to protect you from all those documents you have saved on your servers, desktops and, no doubt, online.

Plus, there will always be the odd user who forgets to clean the metadata before saving.  For bulk cleaning, you'll have to look beyond the desktop plugins, but there are enterprise solutions out there.

Resources

Michal Zalewski (Revisionist) - lcamtuf.coredump.cx

Larry Pesce (Metadata the Silent Killer) - www.sans.org/reading_room

Phil Harvey (ExifTool) - exiftool.org

Return to $2600 Index