Does Open Data Need Journalism?


Paper by Jonathan Stoneman at Reuters Institute for Journalism: “The Open Data movement really came into being when President Obama issued his first policy paper, on his first day in office in January 2009. The US government opened up thousands of datasets to scrutiny by the public, by journalists, by policy-makers. Coders and developers were also invited to make the data useful to people and businesses in all manner of ways. Other governments across the globe followed suit, opening up data to their populations.

Opening data in this way has not resulted in genuine openness, save in a few isolated cases. In the USA and a few European countries, developers have created apps and websites which draw on Open Data, but these are not reaching a mass audience.

At the same time, journalists are not seen by government as the end users of these data. Data releases, even in the best cases, are uneven, and slow, and do not meet the needs of journalists. Although thousands of journalists have been learning and adopting the new skills of datajournalism they have tended to work with data obtained through Freedom of Information (FOI) legislation.

Stories which have resulted from datajournalists’ efforts have rarely been front page news; in many cases data-driven stories have ended up as lesser stories on inside pages, or as infographics, which relatively few people look at.

In this context, therefore, Open Data remains outside the mainstream of journalism, and out of the consciousness of the electorate, begging the question, “what are Open Data for?”, or as one developer put it – “if Open Data is the answer, what was the question?” Openness is seen as a badge of honour – scores of national governments have signed pledges to make data open, often repeating the same kind of idealistic official language as the previous announcement of a conversion to openness. But these acts are “top down”, and soon run out of momentum, becoming simply openness for its own sake. Looking at specific examples, the United States is the nearest to a success story: there is a rich ecosystem – made up of government departments, interest groups and NGOs, the media, civil society – which allows data driven projects the space to grow and the airtime to make an impact. (It probably helped that the media in the US were facing an existential challenge urgent enough to force them to embrace new, inexpensive, ways of carrying out investigative reporting).

Elsewhere data are making less impact on journalism. In the UK the new openness is being exploited by a small minority. Where data are made published on the data.gov.uk website they are frequently out of date, incomplete, or of limited new value, so where data do drive stories, these tend to be data released under FOI legislation, and the resulting stories take the form of statistics and/or infographics.

In developing countries where Open Data Portals have been launched with a fanfare – such as Kenya, and more recently Burkina Faso – there has been little uptake by coders, journalists, or citizens, and the number of fresh datasets being published drops to a trickle, and are soon well out of date. Small, apparently randomly selected datasets are soon outdated and inertia sets in.

The British Conservative Party, pledging greater openness in its 2010 manifesto, foresaw armies of “Armchair Auditors” who would comb through the data and present the government with ideas for greater efficiency in the use of public funds. Almost needless to say, these armies have never materialised, and thousands of datasets go unscrutinised by anybody. 2 In countries like Britain large amounts of data are being published but going (probably) unread and unscrutinised by anybody. At the same time, the journalists who want to make use of data are getting what they need through FOI, or even by gathering data themselves. Open Data is thus being bypassed, and could become an irrelevance. Yet, the media could be vital agents in the quest for the release of meaningful, relevant, timely data.

Governments seem in no hurry to expand the “comfort zone” from which they release the data which shows their policies at their most effective, and keeping to themselves data which paints a gloomier picture. Journalists seem likely to remain in their comfort zone, where they make use of FOI and traditional sources of information. For their part, journalists should push for better data and use it more, working in collaboration with open data activists. They need to change the habits of a lifetime and discuss their sources: revealing the source and quality of data used in a story would in itself be as much a part of the advocacy as of the actual reporting.

If Open Data are to be part of a new system of democratic accountability, they need to be more than a gesture of openness. Nor should Open Data remain largely the preserve of companies using them for commercial purposes. Governments should improve the quality and relevance of published data, making them genuinely useful for journalists and citizens alike….(More)”

Peer review in 2015: A global view


A white paper by Taylor & Francis: “Within the academic community, peer review is widely recognized as being at the heart of scholarly research. However, faith in peer review’s integrity is of ongoing and increasing concern to many. It is imperative that publishers (and academic editors) of peer-reviewed scholarly research learn from each other, working together to improve practices in areas such as ethical issues, training, and data transparency….Key findings:

  • Authors, editors and reviewers all agreed that the most important motivation to publish in peer reviewed journals is making a contribution to the field and sharing research with others.
  • Playing a part in the academic process and improving papers are the most important motivations for reviewers. Similarly, 90% of SAS study respondents said that playing a role in the academic community was a motivation to review.
  • Most researchers, across the humanities and social sciences (HSS) and science, technology and medicine (STM), rate the benefit of the peer review process towards improving their article as 8 or above out of 10. This was found to be the most important aspect of peer review in both the ideal and the real world, echoing the earlier large-scale peer review studies.
  • In an ideal world, there is agreement that peer review should detect plagiarism (with mean ratings of 7.1 for HSS and 7.5 for STM out of 10), but agreement that peer review is currently achieving this in the real world is only 5.7 HSS / 6.3 STM out of 10.
  • Researchers thought there was a low prevalence of gender bias but higher prevalence of regional and seniority bias – and suggest that double blind peer review is most capable of preventing reviewer discrimination where it is based on an author’s identity.
  • Most researchers wait between one and six months for an article they’ve written to undergo peer review, yet authors (not reviewers / editors) think up to two months is reasonable .
  • HSS authors say they are kept less well informed than STM authors about the progress of their article through peer review….(More)”

Government as a Platform: a historical and architectural analysis


Paper by Bendik Bygstad and Francis D’Silva: “A national administration is dependent on its archives and registers, for many purposes, such as tax collection, enforcement of law, economic governance, and welfare services. Today, these services are based on large digital infrastructures, which grow organically in volume and scope. Building on a critical realist approach we investigate a particularly successful infrastructure in Norway called Altinn, and ask: what are the evolutionary mechanisms for a successful “government as a platform”? We frame our study with two perspectives; a historical institutional perspective that traces the roots of Altinn back to the Middle Ages, and an architectural perspective that allows for a more detailed analysis of the consequences of digitalization and the role of platforms. We offer two insights from our study: we identify three evolutionary mechanisms of national registers, and we discuss a future scenario of government platforms as “digital commons”…(More)”

Politics and the New Machine


Jill Lepore in the NewYorker on “What the turn from polls to data science means for democracy”: “…The modern public-opinion poll has been around since the Great Depression, when the response rate—the number of people who take a survey as a percentage of those who were asked—was more than ninety. The participation rate—the number of people who take a survey as a percentage of the population—is far lower. Election pollsters sample only a minuscule portion of the electorate, not uncommonly something on the order of a couple of thousand people out of the more than two hundred million Americans who are eligible to vote. The promise of this work is that the sample is exquisitely representative. But the lower the response rate the harder and more expensive it becomes to realize that promise, which requires both calling many more people and trying to correct for “non-response bias” by giving greater weight to the answers of people from demographic groups that are less likely to respond. Pollster.com’s Mark Blumenthal has recalled how, in the nineteen-eighties, when the response rate at the firm where he was working had fallen to about sixty per cent, people in his office said, “What will happen when it’s only twenty? We won’t be able to be in business!” A typical response rate is now in the single digits.

Meanwhile, polls are wielding greater influence over American elections than ever….

Still, data science can’t solve the biggest problem with polling, because that problem is neither methodological nor technological. It’s political. Pollsters rose to prominence by claiming that measuring public opinion is good for democracy. But what if it’s bad?

A “poll” used to mean the top of your head. Ophelia says of Polonius, “His beard as white as snow: All flaxen was his poll.” When voting involved assembling (all in favor of Smith stand here, all in favor of Jones over there), counting votes required counting heads; that is, counting polls. Eventually, a “poll” came to mean the count itself. By the nineteenth century, to vote was to go “to the polls,” where, more and more, voting was done on paper. Ballots were often printed in newspapers: you’d cut one out and bring it with you. With the turn to the secret ballot, beginning in the eighteen-eighties, the government began supplying the ballots, but newspapers kept printing them; they’d use them to conduct their own polls, called “straw polls.” Before the election, you’d cut out your ballot and mail it to the newspaper, which would make a prediction. Political parties conducted straw polls, too. That’s one of the ways the political machine worked….

Ever since Gallup, two things have been called polls: surveys of opinions and forecasts of election results. (Plenty of other surveys, of course, don’t measure opinions but instead concern status and behavior: Do you own a house? Have you seen a doctor in the past month?) It’s not a bad idea to reserve the term “polls” for the kind meant to produce election forecasts. When Gallup started out, he was skeptical about using a survey to forecast an election: “Such a test is by no means perfect, because a preelection survey must not only measure public opinion in respect to candidates but must also predict just what groups of people will actually take the trouble to cast their ballots.” Also, he didn’t think that predicting elections constituted a public good: “While such forecasts provide an interesting and legitimate activity, they probably serve no great social purpose.” Then why do it? Gallup conducted polls only to prove the accuracy of his surveys, there being no other way to demonstrate it. The polls themselves, he thought, were pointless…

If public-opinion polling is the child of a strained marriage between the press and the academy, data science is the child of a rocky marriage between the academy and Silicon Valley. The term “data science” was coined in 1960, one year after the Democratic National Committee hired Simulmatics Corporation, a company founded by Ithiel de Sola Pool, a political scientist from M.I.T., to provide strategic analysis in advance of the upcoming Presidential election. Pool and his team collected punch cards from pollsters who had archived more than sixty polls from the elections of 1952, 1954, 1956, 1958, and 1960, representing more than a hundred thousand interviews, and fed them into a UNIVAC. They then sorted voters into four hundred and eighty possible types (for example, “Eastern, metropolitan, lower-income, white, Catholic, female Democrat”) and sorted issues into fifty-two clusters (for example, foreign aid). Simulmatics’ first task, completed just before the Democratic National Convention, was a study of “the Negro vote in the North.” Its report, which is thought to have influenced the civil-rights paragraphs added to the Party’s platform, concluded that between 1954 and 1956 “a small but significant shift to the Republicans occurred among Northern Negroes, which cost the Democrats about 1 per cent of the total votes in 8 key states.” After the nominating convention, the D.N.C. commissioned Simulmatics to prepare three more reports, including one that involved running simulations about different ways in which Kennedy might discuss his Catholicism….

Data science may well turn out to be as flawed as public-opinion polling. But a stage in the development of any new tool is to imagine that you’ve perfected it, in order to ponder its consequences. I asked Hilton to suppose that there existed a flawless tool for measuring public opinion, accurately and instantly, a tool available to voters and politicians alike. Imagine that you’re a member of Congress, I said, and you’re about to head into the House to vote on an act—let’s call it the Smeadwell-Nutley Act. As you do, you use an app called iThePublic to learn the opinions of your constituents. You oppose Smeadwell-Nutley; your constituents are seventy-nine per cent in favor of it. Your constituents will instantly know how you’ve voted, and many have set up an account with Crowdpac to make automatic campaign donations. If you vote against the proposed legislation, your constituents will stop giving money to your reëlection campaign. If, contrary to your convictions but in line with your iThePublic, you vote for Smeadwell-Nutley, would that be democracy? …(More)”

 

Push, Pull, and Spill: A Transdisciplinary Case Study in Municipal Open Government


Paper by Jan Whittington et al: “Cities hold considerable information, including details about the daily lives of residents and employees, maps of critical infrastructure, and records of the officials’ internal deliberations. Cities are beginning to realize that this data has economic and other value: If done wisely, the responsible release of city information can also release greater efficiency and innovation in the public and private sector. New services are cropping up that leverage open city data to great effect.

Meanwhile, activist groups and individual residents are placing increasing pressure on state and local government to be more transparent and accountable, even as others sound an alarm over the privacy issues that inevitably attend greater data promiscuity. This takes the form of political pressure to release more information, as well as increased requests for information under the many public records acts across the country.

The result of these forces is that cities are beginning to open their data as never before. It turns out there is surprisingly little research to date into the important and growing area of municipal open data. This article is among the first sustained, cross-disciplinary assessments of an open municipal government system. We are a team of researchers in law, computer science, information science, and urban studies. We have worked hand-in-hand with the City of Seattle, Washington for the better part of a year to understand its current procedures from each disciplinary perspective. Based on this empirical work, we generate a set of recommendations to help the city manage risk latent in opening its data….(More)”

A multi-source dataset of urban life in the city of Milan and the Province of Trentino


Paper by Gianni Barlacchi et al in Scientific Data/Nature: “The study of socio-technical systems has been revolutionized by the unprecedented amount of digital records that are constantly being produced by human activities such as accessing Internet services, using mobile devices, and consuming energy and knowledge. In this paper, we describe the richest open multi-source dataset ever released on two geographical areas. The dataset is composed of telecommunications, weather, news, social networks and electricity data from the city of Milan and the Province of Trentino. The unique multi-source composition of the dataset makes it an ideal testbed for methodologies and approaches aimed at tackling a wide range of problems including energy consumption, mobility planning, tourist and migrant flows, urban structures and interactions, event detection, urban well-being and many others….(More)”

Distinguishing ‘Crowded’ Organizations from Groups and Communities: Is Three a Crowd?


Paper by Gianluigi Viscusi and Christopher L. Tucci: “In conventional wisdom on crowdsourcing, the number of people define the crowd and maximizing this number is often assumed to be the goal of any crowdsourcingexercise. However, we propose that there are structural characteristics of the crowd that might be more important than the sheer number of participants. These characteristics include (1) growth rate and its attractiveness to the members, (2) the equality among members, (3) the density within provisional boundaries, (4) the goal orientation of the crowd, and (5) the “seriality” of the interactions between members of the crowd. We then propose a typology that may allow managers to position their companies’ initiatives among four strategic types: crowd crystals, online communities, closed crowd, and open crowd driven innovation. We show that incumbent companies may prefer a closed and controlled access to the crowd, limiting the potential for gaining results and insights from fully open crowd-driven innovation initiatives. Consequently, we argue that the effects on industries and organizations by open crowds are still to be explored, possibly via the mechanisms of entrepreneurs exploiting open crowds as new entrants, but also for the configuration of industries such as, e.g., finance, pharmaceuticals, or even the public sector where the value created usually comes from interpretation issues and exploratory problem solving…(More).”

When Lobbyists Write Legislation, This Data Mining Tool Traces The Paper Trail


FastCoExist: “Most kids learn the grade school civics lesson about how a bill becomes a law. What those lessons usually neglect to show is how legislation today is often birthed on a lobbyist’s desk.

But even for expert researchers, journalists, and government transparency groups, tracing a bill’s lineage isn’t easy—especially at the state level. Last year alone, there were 70,000 state bills introduced in 50 states. It would take one person five weeks to even read them all. Groups that do track state legislation usually focus narrowly on a single topic, such as abortion, or perhaps a single lobby groups.

Computers can do much better. A prototype tool, presented in September at Bloomberg’sData for Good Exchange 2015 conference, mines the Sunlight Foundation’s database of more than 500,000 bills and 200,000 resolutions for the 50 states from 2007 to 2015. It also compares them to 1,500 pieces of “model legislation” written by a few lobbying groups that made their work available, such as the conservative group ALEC (American Legislative Exchange Council) and the liberal group the State Innovation Exchange(formerly called ALICE).

The results are interesting. In one example of the program in use, the team—all from the Data Science for Social Good fellowship program in Chicago—created a graphic (above) that presents the relative influence of ALEC and ALICE in different states. The thickness of each line in the graphic correlates to the percentage of bills introduced in each state that are modeled on either group’s legislation. So a relatively liberal state like New York is mostly ALICE bills, while a “swing” state like Illinois has a lot from both groups….

Along with researchers from the University of Chicago, Wikimedia Foundation, Microsoft Research, and Northwestern University, Walsh is also co-author of another paperpresented at the Bloomberg conference shows how data science can increase government transparency.

Walsh and these co-authors developed software that automatically identifies earmarks in U.S. Congressional bills, showing how representatives are benefiting their own states with pork barrel projects. They verified that it works by comparing it to the results of a massive effort from the U.S. Office of Management and Budget to analyze earmarks for a few limited years. Their results, extended back to 1995 in a public database, showed that there may be many more earmarks than anyone thought.

“Governments are making more data available. It’s something like a needle in a haystack problem, trying to extract all that information out,” says Walsh. “Both of these projects are really about shining light to these dark places where we don’t know what’s going on.”

The state legislation tracker data is available for download here, and the team is working on an expanded system that automatically downloads new state legislation so it can stay up to date…(More)”

How open company data was used to uncover the powerful elite benefiting from Myanmar’s multi-billion dollar jade industry


OpenCorporates: “Today, we’re pleased to release a white paper on how OpenCorporates data was used to uncover the powerful elite benefiting from Myanmar’s multi-billion dollar jade industry, in a ground-breaking report from Global Witness. This investigation is an important case study on how open company data and identifiers are critical tool to uncover corruption and the links between companies and the real people benefitting from it.

This white paper shows how not only was it critical that OpenCorporates had this information (much of the information was removed from the official register during the investigation), but that the fact that it was machine-readable data, available via an API (data service), and programmatically combinable with other data was essential to discover the hidden connections between the key actors and the jade industry. Global Witness was able to analyse this data with the help of Open Knowledge.

In this white paper, we make recommendations about the collection and publishing of statutory company information as open data to facilitate the creation of a hostile environment for corruption by providing a rigorous framework for public scrutiny and due diligence.

You can find the white paper here or read it on Medium.”

Can Mobile Phone Surveys Identify People’s Development Priorities?


Ben Leo and Robert Morello at the Center for Global Development: “Mobile phone surveys are fast, flexible, and cheap. But, can they be used to engage citizens on how billions of dollars in donor and government resources are spent? Over the last decade, donor governments and multilateral organizations have repeatedly committed to support local priorities and programs. Yet, how are they supposed to identify these priorities on a timely, regular basis? Consistent discussions with the local government are clearly essential, but so are feeding ordinary people’s views into those discussions. However, traditional tools, such as household surveys or consultative roundtables, present a range of challenges for high-frequency citizen engagement. That’s where mobile phone surveys could come in, enabled by the exponential rise in mobile coverage throughout the developing world.

Despite this potential, there have been only a handful of studies into whether mobile surveys are a reliable and representative tool across a broad range of developing-country contexts. Moreover, there have been almost none that specifically look at collecting information about people’s development priorities. Along with Tiago Peixoto,Steve Davenport, and Jonathan Mellon, who focus on promoting citizen engagement and open government practices at the World Bank, we sought to address this policy research gap. Through a study focused on four low-income countries (Afghanistan, Ethiopia, Mozambique, and Zimbabwe), we rigorously tested the feasibility of interactive voice recognition (IVR) surveys for gauging citizens’ development priorities.

Specifically, we wanted to know whether respondents’ answers are sensitive to a range of different factors, such as (i) the specified executing actor (national government or external partners); (ii) time horizons; or (iii) question formats. In other words, can we be sufficiently confident that surveys about people’s priorities can be applied more generally to a range of development actors and across a range of country contexts?

Several of these potential sensitivity concerns were raised in response to an earlier CGD working paper, which found that US foreign aid is only modestly aligned with Africans’ and Latin Americans’ most pressing concerns. This analysis relied upon Afrobarometer and Latinobarometro survey data (see explanatory note below). For instance, some argued that people’s priorities for their own government might be far less relevant for donor organizations. Put differently, the World Bank or USAID shouldn’t prioritize job creation in Nigeria simply because ordinary Nigerians cite it as a pressing government priority. Our hypothesis was that development priorities would likely transcend all development actors, and possibly different timeframes and question formats as well. But, we first needed to test these assumptions.

So, what did we find? We’ve included some of the key highlights below. For a more detailed description of the study and the underlying analysis, please see our new working paper. Along with our World Bank colleagues, we also published an accompanying paper that considers a range of survey method issues, including survey representativeness….(More)”