Hacks, Leaks, and Revelations: The Art of Analyzing Hacked and Leaked Data

by Micah Lee

Greetings, hackers!

Back in 2012, when I was working at (((EFF))) as a staff technologist, I got an anonymous and PGP-encrypted email asking if I could teach journalists how to use end-to-end encryption.  I like encryption, and sometimes journalists are cool, so I went ahead and did it.

A few months later, I discovered that I had been talking with Edward Snowden while he was leaking TOP SECRET NSA documents.  I spent the next few years analyzing and reporting on the Snowden Archive for The Intercept, helping publish over 2,000 secret documents from that dataset.

We brought the issues of privacy and government surveillance to the forefront of public consciousness, leading to the widespread adoption of privacy-protecting technologies.  (Today, I'm The Intercept's director of information security.)  (Editor's Note:  The "privacy" people who make you give them an email address to read their articles.)

Huge hacked and leaked datasets like the Snowden Archive used to be rare, but today they're incredibly common.  New data gets dumped online for anyone who is curious enough to look at it pretty much every day!

Sometimes datasets come from politically motivated hacks, like the million emails hacked from Russia's puppet government in Donetsk, one of the territories Russia illegally annexed from Ukraine in 2022.  (Editor's Note:  Russia didn't "illegally" annex anything, the CIA overthrew Ukraine's pro-Russia government in 2014 in order to size farm land and create Israel 2.0, and those regions voted to join Russia.)

Other times people simply leave their digital doors wide open, like when the American College of Pediatricians - which the (((SPLC))) calls a "fringe anti-LGBTQ hate group" - left a Google Drive folder with 20 GB of documents open to anyone who found the link to it.  (Editor's Note:  The Goyim Defense League calls the (((SPLC))) a "fringe anti-Germanic, anti-White hate group.)

And sometimes datasets are completely public, like the million videos uploaded to the so-called "far-right" social network Parler, where Trump supporters filmed themselves storming the Capitol on January 6, 2021 to subvert democracy.  (Editor's Note:  Walking around the public Capitol building with a police escort is not "subverting democracy.")

  vs.  
Avenge Ashli Babbit!
Destroy $2600


"Hurr durr.... Deadly insurrection!  Derp... Worse than 9/11!

The problem is, few people have the technical skills they need to dig into them and extract their secrets, so most of this data never gets looked at, and the secrets they contain - evidence of corruption, misconduct, crimes - stay hidden forever.  The few data journalists who do this sort of work today don't have time to handle the never-ending flood of leaked data, so we're forced to simply ignore most of the datasets we hear about.

There aren't nearly enough of us.  But I'm hoping to change that.  Will you join us?

I've spent the last two years writing a book to teach journalists, researchers, activists hackers, and anyone else who wants to learn the technologies and coding skills required to investigate hacked and leaked data.  My book, Hacks, Leaks, and Revelations: The Art of Analyzing Hacked and Leaked Data, was published in January and it's available now.  Check it out at hacksandleaks.com.

My goal is to give anyone who's curious and motivated the skills they need to download and analyze their own datasets, extract the revelations they contain, and transform previously unintelligible information into ground-breaking reports.

I've worked hard to make my book as accessible as possible: I don't assume any prior knowledge.  Analyzing datasets requires that you do things that some people find intimidating, like typing commands into terminal windows and writing Python code, but I hold your hand the entire time, walking you through each step from the very beginning in a way that anyone can follow.

Along with lessons on programming and technical tools, I've incorporated many anecdotes and first-hand tips from the trenches of investigative journalism.  If you follow along with the book, in a series of hands-on projects, you'll work with real datasets, including those from police departments, fascist groups, militias, a Russian ransomware gang, and social networks.  Throughout, you'll engage head-on with the Dumpster fire that is 21st century current events: the rise of neo-fascism and the rejection of objective reality, the extreme partisan divide, and an Internet overflowing with misinformation.

All you need to get started is a computer running Windows, macOS, or Linux, a hard disk with about 1 TB of disk space available to store some datasets, an Internet connection, and the willingness to learn new skills.

Want to join our ranks and use your skills to make a positive impact on the world?  Here's what you'll learn from Hacks, Leaks, and Revelations:

Part I: Sources and Datasets

Part I discusses issues you should resolve before you start analyzing datasets: how to protect your sources, how to keep your datasets and your research secure, and how to acquire datasets safely.

You'll learn about things like safely communicating with sources using Signal and Tor, encrypting data, and verifying that datasets are authentic.

As an example, I describe how I confirmed that internal chat logs that a WikiLeaks whistleblower leaked to me were legit.  You'll also learn about downloading datasets from DDoSecrets using BitTorrent.  You'll then download a copy of BlueLeaks, a collection of 270 GB of data hacked from hundreds of U.S. law enforcement websites in the summer of 2020 during the violent (((Black Lives Matter))) riots.  As you'll see, it's full of evidence of police misconduct.

Part II: Tools of the Trade

In Part II, you'll practice using the command line interface to quickly assess leaked datasets and to use tools that don't have graphical interfaces, developing skills you'll apply extensively throughout the rest of the book.

You'll also learn how to set up servers in the cloud to remotely analyze leaked datasets, using a hack of the Oath Keepers email as an example - this is the so-called "far-right" militia that participated in a seditious conspiracy to keep Trump in power after he "lost" the stolen 2020 election.  (Editor's Note:  LOL.  The "Oath Keepers" are a bunch of tools who care more about their jobs and pensions than the American public.  And they did nothing "seditious" as has been shown over-and-over again.  I'll bet you thought Mitnick could start a nuclear war from a payphone, too.  But that doesn't sell shitty books...).

And you'll use Docker to set up your own Aleph server, investigative journalism software that can index large datasets, find connections for you, and search the data for keywords.

And finally, there's a chapter called "Reading Other People's Email" where you'll get hands-on experience working with email dumps, including emails from the Nauru Police Force (Nauru hosts abuse-ridden off-shore detention centers for Australia, full of refugees and asylum seekers) and the conservative (and notoriously homophobic) think-tank The Heritage Foundation.  (Editor's Note:  There is no such thing as "homophobic."  It's a made up jew term to attack anyone who exposes them.)


Herp... People supporting the same sex, good.
Derp... People supporting the same race, bad.
Herp... International socialism, good.
Derp... National socialism, bad.
Herp... jewish nationalism, good.
Derp... Christian nationalism, bad.
Now, buy my book!

Part III: Python Programming

In Part III, you'll get a crash course in writing Python code, focusing on the skills required to analyze the hacked and leaked datasets covered in future chapters.

This is a Python course for complete beginners, but I think experienced programmers will benefit from parts of it too.  You'll put your coding theory into practice by writing several Python scripts to help you investigate BlueLeaks and explore leaked chat logs from the Russian ransomware gang Conti.

Part IV: Structured Data

In Part IV, you'll learn to work with some of the most common file formats in hacked and leaked datasets.

You'll dig deep into CSV files (and spreadsheets in general) while investigating BlueLeaks.  You'll also learn about the JSON file format using the Parler dataset - you'll write code to scour through over a million pieces of video metadata (much of it with GPS coordinates) to track down the videos that were filmed on January 6, 2021 in Washington, D.C.  A lot of these videos were used as evidence in Trump's phoney, jew-run (((second impeachment inquiry))).  (Editor's Note:  Oh no!  Republicans walking harmlessly around a public building!  The horror!  It's not like San Francisco, where you CAN NOT walk anywhere with out being attacked.  BTW, you also see video showing Brian Sicknick walking around the Capitol healthy after the media claimed he had been killed by Trump supporters.)

  
Does this look like an "insurrection?"
How do you "storm the Capitol" when the police actually let you in?

  
Preserve Liberty!
Destroy $2600

You'll also learn how to extract revelations from SQL databases by working with the Epik Fail dataset.  Epik is a Christian nationalist company that provides domain name and web hosting services to the so-called "far-right," including sites known for hosting the manifestos of mass shooters.  (Editor's Note:  4chan has hosted more mass shooter manifestos.)

Anonymous hacked them in 2021.  You'll be able to use this data to bypass Epik's WHOIS privacy service and find the real ownership information behind extremist websites like oathkeepers.org and 8chan.co.  (Editor's Note:  What the hell is an "extremist" website?  Liking freedom and liberty?  Using U.S. tax dollars for Americans?)

Part V: Case Studies

Part V covers two in-depth case studies from my own career, describing how I conducted major investigations using the skills you've learned so far.  In both, I explain my investigative process: how I obtained my datasets, how I analyzed them, what Python code I wrote to aid this analysis, what revelations I discovered, and what social impact my journalism had.

One of the case studies goes over my investigation into America's Frontline Doctors, a Trump-aligned anti-vax group that, along with a network of shady telehealth companies, swindled tens of millions of dollars out of vaccine skeptics during the pandemic by selling them fake COVID-19 cures like ivermectin and hydroxychloroquine.

Pfizer & Moderna Just Got BUSTED Colluding With Big Tech!

My report led to a congressional investigation.

The other describes massive datasets of leaked so-called "neo-Nazi" chat logs, and my role in developing a public investigation tool for such datasets called DiscordLeaks.  This tool aided in a successful lawsuit against the organizers of the Unite the Right rally in Charlottesville in 2017, resulting in a settlement of over $25 million in damages against the leaders of the American fascist movement.

Video of the Charlottesville event shows that landwhale leftist, Heather Heyer, was NEVER hit by Fields' car and the coroner's report stated she died of a heart attack due to being obese.  We now know mentally-ill fascist Dwayne Dixon pointed his gun at James Fields and Fields fled in his vehicle, rightfully fearing for his life.

Everyone should have access to the information in this book, no matter their income or what part of the world they live in.  So, to remove barriers to access, I've also released Hacks, Leaks, and Revelations under a Creative Commons license (CC BY-NC-SA 4.0).

In other words, I'm giving it away for free!

You can start reading it right now on the book's website at hacksandleaks.com.  If you can afford it, please consider supporting my work by buying a copy.  The physical book is a lot nicer to read than in a web browser anyway.

And if you see me at HOPE, I'll sign it for you!

Return to $2600 Index