There has been a lot written about data leaks and the information contained therein, but few books that tell you how to do it yourself. That is the subject of Hacks, Leaks and Revelations that was recently published.
This is a very unique and interesting and informative book, written by Micah Lee, who is the director of information security for The Intercept and has written numerous stories about leaked data over the years, including a dozen articles on some of the contents of the Snowden NSA files. What is unique is that Lee will teach you the skills and techniques that he used to investigate these datasets, and readers can follow along and do their own analysis with this data and others such as emails from the far-right group Oath Keepers. There is also materials leaked from the Heritage Foundation, and chat logs from the Russian ransomware group Conti. This is a book for budding data journalists, as well as for infosec specialists who are trying to harden their data infrastructure and prevent future leaks from happening.
Many of these databases can be found on DDoSecrets, the organization that arose from the ashes of WikiLeaks and where Lee is an adviser.
Lee’s book is also unique in that he starts off his journey with ways that readers can protect their own privacy, and that of potential data sources, as well as ways to verify that the data is authentic, something that even many experienced journalists might want to brush up on. “Because so much of source protection is beyond your control, it’s important to focus on the handful of things that aren’t.” This includes deleting records of interviews, any cloud-based data or local browsing history for example. “You don’t want to end up being a pawn in someone else’s information warfare,” he cautions. He spends time explaining what not to publish or how to redact the data, using his own experience with some very sensitive sources.
One of the interesting facts that I never spent much time thinking about before reading Lee’s book is that while it is illegal to break into a website and steal data, it is perfectly legal for anyone to make a copy of that data once it has been made public and do your own investigation.
Another reason to read Lee’s book is that there is so much practical how-to information, explained in simple step-by-step terms that even computer neophytes can quickly implement them. Each chapter has a series of exercises, split out by operating system, with directions. A good part of the book dives into the command line interface of Windows, Mac and Linux, and how to harness the power of these built-in tools.
Along the way you’ll learn Python scripting to automate the various analytical tasks and use some of his own custom tools that he and his colleagues have made freely available. Automation — and the resulting data visualization — are both key, because the alternative is very tedious examination line by line of the data. He uses the example of searching the BlueLeaks data for “antifa” as an example (this is a collection of data from various law enforcement websites that document misconduct), making things very real. There are other tools such as Signal, an encrypted messaging app, and using BitTorrent. There is also advice on using disk encryption tools and password managers. Lee explains how they work and how he used them in his own data explorations.
One chapter goes into details about how to read other people’s email, which is a popular activity with stolen data.
The book ends with a series of case studies taken from his own reporting, showing how he conducted his investigations, what code he wrote and what he discovered. The cases include leaks from neo-Nazi chat logs, the anti-vax misinformation group America’s Frontline Doctors and videos leaked from the social media site Parler that were used during one of Trump’s impeachment trials. Do you detect a common thread here? These case studies show how hard data analysis is, but they also walk you through Lee’s processes and tools to illustrate its power as well.
Lee’s book is really the syllabus for a graduate-level course in data journalism, and should be a handy reference for beginners and more experienced readers. If you are a software developer, most of his advice and examples will be familiar. But if you are an ordinary computer user, you can quickly gain a lot of knowledge and see how one tool works with another to build an investigation. As Lee says, “I hope you’ll use your skills to discover and publish secret revelations, and to make a positive impact on the world while you’re at it.”