What We’ve Learned About Sharing Our Data Analysis


Jeremy Singer-Vine at Source: “Last Friday morning, Jessica Garrison, Ken Bensinger, and I published a BuzzFeed News investigation highlighting the ease with which American employers have exploited and abused a particular type of foreign worker—those on seasonal H–2 visas. The article drew on seven months’ worth of reporting, scores of interviews, hundreds of documents—and two large datasets maintained by the Department of Labor.

That same morning, we published the corresponding data, methodologies, and analytic code on GitHub. This isn’t the first time we’ve open-sourced our data and analysis; far from it. But the H–2 project represents our most ambitious effort yet. In this post, I’ll describe our current thinking on “reproducible data analyses,” and how the H–2 project reflects those thoughts.

What Is “Reproducible Data Analysis”?

It’s helpful to break down a couple of slightly oversimplified definitions. Let’s call “open-sourcing” the act of publishing the raw code behind a software project. And let’s call “reproducible data analysis” the act of open-sourcing the code and data required to reproduce a set of calculations.

Journalism has seen a mini-boom of reproducible data analysis in the past year or two. (It’s far froma novel concept, of course.) FiveThirtyEight publishes data and re-runnable computer code for many of their stories. You can download the brains and brawn behind Leo, the New York Times’ statistical model for forecasting the outcome of the 2014 midterm Senate elections. And if you want to re-runBarron’s magazine’s analysis of SEC Rule 605 reports, you can do that, too. The list goes on.

….

Why Reproducible Data Analysis?

At BuzzFeed News, our main motivation is simple: transparency. If an article includes our own calculations (and are beyond a grade-schooler’s pen-and-paper calculations), then you should be able to see—and potentially criticize—how we did it…..

There are reasons, of course, not to publish a fully-reproducible analysis. The most obvious and defensible reason: Your data includes Social Security numbers, state secrets, or other sensitive information. Sometimes, you’ll be able to scrub these bits from your data. Other times, you won’t. (Adetailed methodology is a good alternative.)

How To Publish Reproducible Data Analysis?

At BuzzFeed News, we’re still figuring out the best way to skin this cat. Other news organizations might be arrive at entirely opposite conclusions. That said, here are some tips, based on our experience:

Describe the main data sources, and how you got them. Art appraisers and data-driven reporters agree: Provenance matters. Who collected the data? What universe of things does it quantify? How did you get it?.… (More)”