Can big databases be kept both anonymous and useful?


The Economist: “….The anonymisation of a data record typically means the removal from it of personally identifiable information. Names, obviously. But also phone numbers, addresses and various intimate details like dates of birth. Such a record is then deemed safe for release to researchers, and even to the public, to make of it what they will. Many people volunteer information, for example to medical trials, on the understanding that this will happen.

But the ability to compare databases threatens to make a mockery of such protections. Participants in genomics projects, promised anonymity in exchange for their DNA, have been identified by simple comparison with electoral rolls and other publicly available information. The health records of a governor of Massachusetts were plucked from a database, again supposedly anonymous, of state-employee hospital visits using the same trick. Reporters sifting through a public database of web searches were able to correlate them in order to track down one, rather embarrassed, woman who had been idly searching for single men. And so on.

Each of these headline-generating stories creates a demand for more controls. But that, in turn, deals a blow to the idea of open data—that the electronic “data exhaust” people exhale more or less every time they do anything in the modern world is actually useful stuff which, were it freely available for analysis, might make that world a better place.

Of cake, and eating it

Modern cars, for example, record in their computers much about how, when and where the vehicle has been used. Comparing the records of many vehicles, says Viktor Mayer-Schönberger of the Oxford Internet Institute, could provide a solid basis for, say, spotting dangerous stretches of road. Similarly, an opening of health records, particularly in a country like Britain, which has a national health service, and cross-fertilising them with other personal data, might help reveal the multifarious causes of diseases like Alzheimer’s.

This is a true dilemma. People want both perfect privacy and all the benefits of openness. But they cannot have both. The stripping of a few details as the only means of assuring anonymity, in a world choked with data exhaust, cannot work. Poorly anonymised data are only part of the problem. What may be worse is that there is no standard for anonymisation. Every American state, for example, has its own prescription for what constitutes an adequate standard.

Worse still, devising a comprehensive standard may be impossible. Paul Ohm of Georgetown University, in Washington, DC, thinks that this is partly because the availability of new data constantly shifts the goalposts. “If we could pick an industry standard today, it would be obsolete in short order,” he says. Some data, such as those about medical conditions, are more sensitive than others. Some data sets provide great precision in time or place, others merely a year or a postcode. Each set presents its own dangers and requirements.

Fortunately, there are a few easy fixes. Thanks in part to the headlines, many now agree that public release of anonymised data is a bad move. Data could instead be released piecemeal, or kept in-house and accessible by researchers through a question-and-answer mechanism. Or some users could be granted access to raw data, but only in strictly controlled conditions.

All these approaches, though, are anathema to the open-data movement, because they limit the scope of studies. “If we’re making it so hard to share that only a few have access,” says Tim Althoff, a data scientist at Stanford University, “that has profound implications for science, for people being able to replicate and advance your work.”

Purely legal approaches might mitigate that. Data might come with what have been called “downstream contractual obligations”, outlining what can be done with a given data set and holding any onward recipients to the same standards. One perhaps draconian idea, suggested by Daniel Barth-Jones, an epidemiologist at Columbia University, in New York, is to make it illegal even to attempt re-identification….(More).”

One way traffic: The open data initiative project and the need for an effective demand side initiative in Ghana


Paper by Frank L. K. Ohemeng and Kwaku Ofosu-Adarkwa in the Government Information Quarterly: “In recent years the necessity for governments to develop new public values of openness and transparency, and thereby increase their citizenries’ sense of inclusiveness, and their trust in and confidence about their governments, has risen to the point of urgency. The decline of trust in governments, especially in developing countries, has been unprecedented and continuous. A new paradigm that signifies a shift to citizen-driven initiatives over and above state- and market-centric ones calls for innovative thinking that requires openness in government. The need for this new synergy notwithstanding, Open Government cannot be considered truly open unless it also enhances citizen participation and engagement. The Ghana Open Data Initiative (GODI) project strives to create an open data community that will enable government (supply side) and civil society in general (demand side) to exchange data and information. We argue that the GODI is too narrowly focused on the supply side of the project, and suggest that it should generate an even platform to improve interaction between government and citizens to ensure a balance in knowledge sharing with and among all constituencies….(More)”

What factors influence transparency in US local government?


Grichawat Lowatcharin and Charles Menifield at LSE Impact Blog: “The Internet has opened a new arena for interaction between governments and citizens, as it not only provides more efficient and cooperative ways of interacting, but also more efficient service delivery, and more efficient transaction activities. …But to what extent does increased Internet access lead to higher levels of government transparency? …While we found Internet access to be a significant predictor of Internet-enabled transparency in our simplest model, this finding did not hold true in our most extensive model. This does not negate that fact that the variable is an important factor in assessing transparency levels and Internet access. …. Our data shows that total land area, population density, percentage of minority, education attainment, and the council-manager form of government are statistically significant predictors of Internet-enabled transparency.  These findings both confirm and negate the findings of previous researchers. For example, while the effect of education on transparency appears to be the most consistent finding in previous research, we also noted that the rural/urban (population density) dichotomy and the education variable are important factors in assessing transparency levels. Hence, as governments create strategic plans that include growth models, they should not only consider the budgetary ramifications of growth, but also the fact that educated residents want more web based interaction with government. This finding was reinforced by a recent Census Bureau report indicating that some of the cities and counties in Florida and California had population increases greater than ten thousand persons per month during the period 2013-2014.

This article is based on the paper ‘Determinants of Internet-enabled Transparency at the Local Level: A Study of Midwestern County Web Sites’, in State and Local Government Review. (More)”

Making data open for everyone


Kathryn L.S. Pettit and Jonathan Schwabis at UrbanWire: “Over the past few years, there have been some exciting developments in open source tools and programming languages, business intelligence tools, big data, open data, and data visualization. These trends, and others, are changing the way we interact with and consume information and data. And that change is driving more organizations and governments to consider better ways to provide their data to more people.

The World Bank, for example, has a concerted effort underway to open its data in better and more visual ways. Google’s Public Data Explorer brings together large datasets from around the world into a single interface. For-profit providers like OpenGov and Socrata are helping local, state, and federal governments open their data (both internally and externally) in newer platforms.

We are firm believers in open data. (There are, of course, limitations to open data because of privacy or security, but that’s a discussion for another time). But open data is not simply about putting more data on the Internet. It’s not just only about posting files and telling people where to find them. To allow and encourage more people to use and interact with data, that data needs to be useful and readable not only by researchers, but also by the dad in northern Virginia or the student in rural Indiana who wants to know more about their public libraries.

Open data should be easy to access, analyze, and visualize

Many are working hard to provide more data in better ways, but we have a long way to go. Take, for example, the Congressional Budget Office (full disclosure, one of us used to work at CBO). Twice a year, CBO releases its Budget and Economic Outlook, which provides the 10-year budget projections for the federal government. Say you want to analyze 10-year budget projections for the Pell Grant program. You’d need to select “Get Data” and click on “Baseline Projections for Education” and then choose “Pell Grant Programs.” This brings you to a PDF report, where you can copy the data table you’re looking for into a format you can actually use (say, Excel). You would need to repeat the exercise to find projections for the 21 other programs for which the CBO provides data.

In another case, the Bureau of Labor Statistics has tried to provide users with query tools that avoid the use of PDFs, but still require extra steps to process. You can get the unemployment rate data through their Java Applet (which doesn’t work on all browsers, by the way), select the various series you want, and click “Get Data.” On the subsequent screen, you are given some basic formatting options, but the default display shows all of your data series as separate Excel files. You can then copy and paste or download each one and then piece them together.

Taking a step closer to the ideal of open data, the Institute of Museum and Library Services (IMLS)followed President Obama’s May 2013 executive order to make their data open in a machine-readable format. That’s great, but it only goes so far. The IMLS platform, for example, allows you to explore information about your own public library. But the data are labeled with variable names such as BRANLIB and BKMOB that are not intuitive or clear. Users then have to find the data dictionary to understand what data fields mean, how they’re defined, and how to use them.

These efforts to provide more data represent real progress, but often fail to be useful to the average person. They move from publishing data that are not readable (buried in PDFs or systems that allow the user to see only one record at a time) to data that are machine-readable (libraries of raw data files or APIs, from which data can be extracted using computer code). We now need to move from a world in which data are simply machine-readable to one in which data are human-readable….(More)”

Push, Pull, and Spill: A Transdisciplinary Case Study in Municipal Open Government


New paper by Jan Whittington et al: “Cities hold considerable information, including details about the daily lives of residents and employees, maps of critical infrastructure, and records of the officials’ internal deliberations. Cities are beginning to realize that this data has economic and other value: If done wisely, the responsible release of city information can also release greater efficiency and innovation in the public and private sector. New services are cropping up that leverage open city data to great effect.

Meanwhile, activist groups and individual residents are placing increasing pressure on state and local government to be more transparent and accountable, even as others sound an alarm over the privacy issues that inevitably attend greater data promiscuity. This takes the form of political pressure to release more information, as well as increased requests for information under the many public records acts across the country.

The result of these forces is that cities are beginning to open their data as never before. It turns out there is surprisingly little research to date into the important and growing area of municipal open data. This article is among the first sustained, cross-disciplinary assessments of an open municipal government system. We are a team of researchers in law, computer science, information science, and urban studies. We have worked hand-in-hand with the City of Seattle, Washington for the better part of a year to understand its current procedures from each disciplinary perspective. Based on this empirical work, we generate a set of recommendations to help the city manage risk latent in opening its data….(More)”

What We’ve Learned About Sharing Our Data Analysis


Jeremy Singer-Vine at Source: “Last Friday morning, Jessica Garrison, Ken Bensinger, and I published a BuzzFeed News investigation highlighting the ease with which American employers have exploited and abused a particular type of foreign worker—those on seasonal H–2 visas. The article drew on seven months’ worth of reporting, scores of interviews, hundreds of documents—and two large datasets maintained by the Department of Labor.

That same morning, we published the corresponding data, methodologies, and analytic code on GitHub. This isn’t the first time we’ve open-sourced our data and analysis; far from it. But the H–2 project represents our most ambitious effort yet. In this post, I’ll describe our current thinking on “reproducible data analyses,” and how the H–2 project reflects those thoughts.

What Is “Reproducible Data Analysis”?

It’s helpful to break down a couple of slightly oversimplified definitions. Let’s call “open-sourcing” the act of publishing the raw code behind a software project. And let’s call “reproducible data analysis” the act of open-sourcing the code and data required to reproduce a set of calculations.

Journalism has seen a mini-boom of reproducible data analysis in the past year or two. (It’s far froma novel concept, of course.) FiveThirtyEight publishes data and re-runnable computer code for many of their stories. You can download the brains and brawn behind Leo, the New York Times’ statistical model for forecasting the outcome of the 2014 midterm Senate elections. And if you want to re-runBarron’s magazine’s analysis of SEC Rule 605 reports, you can do that, too. The list goes on.

….

Why Reproducible Data Analysis?

At BuzzFeed News, our main motivation is simple: transparency. If an article includes our own calculations (and are beyond a grade-schooler’s pen-and-paper calculations), then you should be able to see—and potentially criticize—how we did it…..

There are reasons, of course, not to publish a fully-reproducible analysis. The most obvious and defensible reason: Your data includes Social Security numbers, state secrets, or other sensitive information. Sometimes, you’ll be able to scrub these bits from your data. Other times, you won’t. (Adetailed methodology is a good alternative.)

How To Publish Reproducible Data Analysis?

At BuzzFeed News, we’re still figuring out the best way to skin this cat. Other news organizations might be arrive at entirely opposite conclusions. That said, here are some tips, based on our experience:

Describe the main data sources, and how you got them. Art appraisers and data-driven reporters agree: Provenance matters. Who collected the data? What universe of things does it quantify? How did you get it?.… (More)”

The New Science of Sentencing


Anna Maria Barry-Jester et al at the Marshall Project: “Criminal sentencing has long been based on the present crime and, sometimes, the defendant’s past criminal record. In Pennsylvania, judges could soon consider a new dimension: the future.

Pennsylvania is on the verge of becoming one of the first states in the country to base criminal sentences not only on what crimes people have been convicted of, but also on whether they are deemed likely to commit additional crimes. As early as next year, judges there could receive statistically derived tools known as risk assessments to help them decide how much prison time — if any — to assign.

Risk assessments have existed in various forms for a century, but over the past two decades, they have spread through the American justice system, driven by advances in social science. The tools try to predict recidivism — repeat offending or breaking the rules of probation or parole — using statistical probabilities based on factors such as age, employment history and prior criminal record. They are now used at some stage of the criminal justice process in nearly every state. Many court systems use the tools to guide decisions about which prisoners to release on parole, for example, and risk assessments are becoming increasingly popular as a way to help set bail for inmates awaiting trial.

But Pennsylvania is about to take a step most states have until now resisted for adult defendants: using risk assessment in sentencing itself. A state commission is putting the finishing touches on a plan that, if implemented as expected, could allow some offenders considered low risk to get shorter prison sentences than they would otherwise or avoid incarceration entirely. Those deemed high risk could spend more time behind bars.

Pennsylvania, which already uses risk assessment in other phases of its criminal justice system, is considering the approach in sentencing because it is struggling with an unwieldy and expensive corrections system. Pennsylvania has roughly 50,000 people in state custody, 2,000 more than it has permanent beds for. Thousands more are in local jails, and hundreds of thousands are on probation or parole. The state spends $2 billion a year on its corrections system — more than 7 percent of the total state budget, up from less than 2 percent 30 years ago. Yet recidivism rates remain high: 1 in 3inmates is arrested again or reincarcerated within a year of being released.

States across the country are facing similar problems — Pennsylvania’s incarceration rate is almost exactly the national average — and many policymakers see risk assessment as an attractive solution. Moreover, the approach has bipartisan appeal: Among some conservatives, risk assessment appeals to the desire to spend tax dollars on locking up only those criminals who are truly dangerous to society. And some liberals hope a data-driven justice system will be less punitive overall and correct for the personal, often subconscious biases of police, judges and probation officers. In theory, using risk assessment tools could lead to both less incarceration and less crime.

There are more than 60 risk assessment tools in use across the U.S., and they vary widely. But in their simplest form, they are questionnaires — typically filled out by a jail staff member, probation officer or psychologist — that assign points to offenders based on anything from demographic factors to family background to criminal history. The resulting scores are based on statistical probabilities derived from previous offenders’ behavior. A low score designates an offender as “low risk” and could result in lower bail, less prison time or less restrictive probation or parole terms; a high score can lead to tougher sentences or tighter monitoring.

The risk assessment trend is controversial. Critics have raised numerous questions: Is it fair to make decisions in an individual case based on what similar offenders have done in the past? Is it acceptable to use characteristics that might be associated with race or socioeconomic status, such as the criminal record of a person’s parents? And even if states can resolve such philosophical questions, there are also practical ones: What to do about unreliable data? Which of the many available tools — some of them licensed by for-profit companies — should policymakers choose?…(More)”

The Data Divide: What We Want and What We Can Get


Craig Adelman and Erin Austin at Living Cities (Read Blog 1):There is no shortage of data. At every level–federal, state, county, city and even within our own organizations–we are collecting and trying to make use of data. Data is a catch-all term that suggests universal access and easy use. The problem? In reality, data is often expensive, difficult to access, created for a single purpose, quickly changing and difficult to weave together. To aid and inform future data-dependent research initiatives, we’ve outlined the common barriers that community development faces when working with data and identified three ways to overcome them.

Common barriers include:

  • Data often comes at a hefty price. …
  • Data can come with restrictions and regulations. …
  • Data is built for a specific purpose, meaning information isn’t always in the same place. …
  • Data can actually be too big. ….
  • Data gaps exist. …
  • Data can be too old. ….

As you can tell, there can be many complications when it comes to working with data, but there is still great value to using and having it. We’ve found a few way to overcome these barriers when scoping a research project:

1) Prepare to have to move to “Plan B” when trying to get answers that aren’t readily available in the data. It is incredibly important to be able to react to unexpected data conditions and to use proxy datasets when necessary in order to efficiently answer the core research question.

2) Building a data budget for your work is also advisable, as you shouldn’t anticipate that public entities or private firms will give you free data (nor that community development partners will be able to share datasets used for previous studies).

3) Identifying partners—including local governments, brokers, and community development or CDFI partners—is crucial to collecting the information you’ll need….(More)

Confronting the Internet’s Dark Side: Moral and Social Responsibility on the Free Highway


New book by Raphael Cohen-Almagor: “Terrorism, cyberbullying, child pornography, hate speech, cybercrime: along with unprecedented advancements in productivity and engagement, the Internet has ushered in a space for violent, hateful, and antisocial behavior. How do we, as individuals and as a society, protect against dangerous expressions online? Confronting the Internet’s Dark Side is the first book on social responsibility on the Internet. It aims to strike a balance between the free speech principle and the responsibilities of the individual, corporation, state, and the international community. This book brings a global perspective to the analysis of some of the most troubling uses of the Internet. It urges net users, ISPs, and liberal democracies to weigh freedom and security, finding the golden mean between unlimited license and moral responsibility. This judgment is necessary to uphold the very liberal democratic values that gave rise to the Internet and that are threatened by an unbridled use of technology. (More)

Quantifying Crowd Size with Mobile Phone and Twitter Data


, , and Being able to infer the number of people in a specific area is of extreme importance for the avoidance of crowd disasters and to facilitate emergency evacuations. Here, using a football stadium and an airport as case studies, we present evidence of a strong relationship between the number of people in restricted areas and activity recorded by mobile phone providers and the online service Twitter. Our findings suggest that data generated through our interactions with mobile phone networks and the Internet may allow us to gain valuable measurements of the current state of society….(More)”