Collection

Big Data in the Public Sector

Curated on April 9, 2016August 3, 2018 by Stefaan Verhulst

Chapter by Ricard Munné in New Horizons for a Data-Driven Economy: “The public sector is becoming increasingly aware of the potential value to be gained from big data, as governments generate and collect vast quantities of data through their everyday activities.

The benefits of big data in the public sector can be grouped into three major areas, based on a classification of the types of benefits: advanced analytics, through automated algorithms; improvements in effectiveness, providing greater internal transparency; improvements in efficiency, where better services can be provided based on the personalization of services; and learning from the performance of such services.

The chapter examined several drivers and constraints that have been identified, which can boost or stop the development of big data in the sector depending on how they are addressed. The findings, after analysing the requirements and the technologies currently available, show that there are open research questions to be addressed in order to develop such technologies so competitive and effective solutions can be built. The main developments are required in the fields of scalability of data analysis, pattern discovery, and real-time applications. Also required are improvements in provenance for the sharing and integration of data from the public sector. It is also extremely important to provide integrated security and privacy mechanisms in big data applications, as public sector collects vast amounts of sensitive data. Finally, respecting the privacy of citizens is a mandatory obligation in the European Union….(More)”

First, design for data sharing

Curated on April 9, 2016August 3, 2018 by Stefaan Verhulst

John Wilbanks & Stephen H Friend in Nature Biotechnology: “To upend current barriers to sharing clinical data and insights, we need a framework that not only accounts for choices made by trial participants but also qualifies researchers wishing to access and analyze the data.

This March, Sage Bionetworks (Seattle) began sharing curated data collected from >9,000 participants of mPower, a smartphone-enabled health research study for Parkinson’s disease. The mPower study is notable as one of the first observational assessments of human health to rapidly achieve scale as a result of its design and execution purely through a smartphone interface. To support this unique study design, we developed a novel electronic informed consent process that includes participant-determined data-sharing preferences. It is through these preferences that the new data—including self-reported outcomes and quantitative sensor data—are shared broadly for secondary analysis. Our hope is that by sharing these data immediately, prior even to our own complete analysis, we will shorten the time to harnessing any utility that this study’s data may hold to improve the condition of patients who suffer from this disease.

Turbulent times for data sharing

Our release of mPower comes at a turbulent time in data sharing. The power of data for secondary research is top of mind for many these days. Vice President Joe Biden, in heading President Barack Obama’s ambitious cancer ‘moonshot’, describes data sharing as second only to funding to the success of the effort. However, this powerful support for data sharing stands in opposition to the opinions of many within the research establishment. To wit, the august New England Journal of Medicine (NEJM)’s recent editorial suggesting that those who wish to reuse clinical trial data without the direct participation and approval of the original study team are “research parasites”⁴. In the wake of colliding perspectives on data sharing, we must not lose sight of the scientific and societal ends served by such efforts.

It is important to acknowledge that meaningful data sharing is a nontrivial process that can require substantial investment to ensure that data are shared with sufficient context to guide data users. When data analysis is narrowly targeted to answer a specific and straightforward question—as with many clinical trials—this added effort might not result in improved insights. However, many areas of science, such as genomics, astronomy and high-energy physics, have moved to data collection methods in which large amounts of raw data are potentially of relevance to a wide variety of research questions, but the methodology of moving from raw data to interpretation is itself a subject of active research….(More)”

Wiki-fishing

Curated on April 9, 2016October 24, 2018 by Stefaan Verhulst

The Economist: “….Mr Rhoads is a member of a network started by the Alaska Longline Fishermen’s Association (ALFA), which aims to do something about this and to reduce by-catch of sensitive species such as rockfish at the same time. Network fishermen, who numbered only 20 at the project’s start, agreed to share data on where and what they were catching in order to create maps that highlighted areas of high by-catch. Within two years they had reduced accidental rockfish harvest by as much as 20%.

The rockfish mapping project expanded to create detailed maps of the sea floor, pooling data gathered by transducers fixed to the bottoms of boats. By combining thousands of data points as vessels traverse the fishing grounds, these “wikimaps”—created and updated through crowdsourcing—show gravel beds where bottom-dwelling halibut are likely to linger, craggy terrain where rockfish tend to lurk, and outcrops that could snag gear.

Public charts are imprecise, and equipment with the capability to sense this level of detail could cost a fisherman more than $70,000. Skippers join ALFA for as little as $250, invest a couple of thousand dollars in computers and software and enter into an agreement to turn over fishing data and not to share the information outside the network, which now includes 85 fishermen.

Skippers say the project makes them more efficient, better able to find the sort of fish they want and avoid squandering time on lost or tangled gear. It also means fewer hooks in the water and fewer hours at sea to catch the same amount of fish….(More)”

Data and Humanitarian Response

Curated on April 7, 2016August 3, 2018 by Stefaan Verhulst

The GovLab: “As part of an ongoing effort to build a knowledge base for the field of opening governance by organizing and disseminating its learnings, the GovLab Selected Readings series provides an annotated and curated collection of recommended works on key opening governance topics. In this edition, we explore the literature on Data and Humanitarian Response. To suggest additional readings on this or any other topic, please email [email protected]. All our Selected Readings can be found here.

Context

Data, when used well in a trusted manner , allows humanitarian organizations to innovate how to respond to emergency events, including better coordination of post-disaster relief efforts, the ability to harness local knowledge to create more targeted relief strategies, and tools to predict and monitor disasters in real time. Consequently, in recent years both multinational groups and community-based advocates have begun to integrate data collection and evaluation strategies into their humanitarian operations, to better and more quickly respond to emergencies. However, this movement poses a number of challenges. Compared to the private sector, humanitarian organizations are often less equipped to successfully analyze and manage big data, which pose a number of risks related to the security of victims’ data. Furthermore, complex power dynamics which exist within humanitarian spaces may be further exacerbated through the introduction of new technologies and big data collection mechanisms. In the below we share:

Selected Reading List (summaries and hyperlinks)
Annotated Selected Reading List
Additional Readings….(More)”

Data collection is the ultimate public good

Curated on April 7, 2016August 3, 2018 by Stefaan Verhulst

Lawrence H. Summers in the Washington Post: “I spoke at a World Bank conference on price statistics. … I am convinced that data is the ultimate public good and that we will soon have much more data than we do today. I made four primary observations.

First, scientific progress is driven more by new tools and new observations than by hypothesis construction and testing. I cited a number of examples: the observation that Jupiter was orbited by several moons clinched the case against the Ptolemaic system, the belief that all celestial objects circle around the Earth. We learned of cells by seeing them when the microscope was constructed. Accelerators made the basic structure of atoms obvious.

Second, if mathematics is the queen of the hard sciences then statistics is the queen of the social sciences. I gave examples of the power of very simple data analysis. We first learned that exercise is good for health from the observation that, in the 1940s, London bus conductors had much lower death rates than bus drivers. Similarly, data demonstrated that smoking was a major killer decades before the biological processes were understood. At a more trivial level, “Moneyball” shows how data-based statistics can revolutionize a major sport.

Third, I urged that what “you count counts” and argued that we needed much more timely and complete data. I noted the centrality of timely statistics to meaningful progress toward Sustainable Development Goals. In comparison to the nearly six-year lag in poverty statistics, it took the United States only about 3½ years to win World War II.

Fourth, I envisioned what might be possible in a world where there will soon be as many smartphones as adults. With the ubiquitous ability to collect data and nearly unlimited ability to process it will come more capacity to discover previously unknown relationships. We will improve our ability to predict disasters like famines, storms and revolutions. Communication technologies will allow us to better hold policymakers to account with reliable and rapid performance measures. And if history is any guide, we will gain capacities on dimensions we cannot now imagine but will come to regard as indispensable.

This is the work of both governments and the private sector. It is fantasy to suppose data, as the ultimate public good, will come into being without government effort. Equally, we will sell ourselves short if we stick with traditional collection methods and ignore innovative providers and methods such as the use of smartphones, drones, satellites and supercomputers. That is why something like the Billion Prices Project at MIT, which can provide daily price information, is so important. That is why I am excited to be a director and involved with Premise — a data company that analyzes information people collect on their smartphones about everyday life, like the price of local foods — in its capacity to mobilize these technologies as widely as possible. That is why Planet Labs, with its capacity to scan and monitor environmental conditions, represents such a profound innovation….(More)

Organizational Routines: How They Are Created, Maintained, and Changed

Curated on April 7, 2016August 3, 2018 by Stefaan Verhulst

Book edited by Jennifer Howard-Grenville, Claus Rerup, Ann Langley, and Haridimos Tsoukas: “Over the past 15 years, organizational routines have been increasingly investigated from a process perspective to challenge the idea that routines are stable entities that are mindlessly enacted.

A process perspective explores how routines are performed by specific people in specific settings. It shows how action, improvisation, and novelty are part of routine performances. It also departs from a view of routines as “black boxes” that transform inputs into organizational outputs and places attention on the actual actions and patterns that comprise routines. Routines are both effortful accomplishments, in that it takes effort to perform, sustain, or change them, and emergent accomplishments, because sometimes the effort to perform routines leads to unforeseen change.

While a process perspective has enabled scholars to open up the “black box” of routines and explore their actions and patterns in fine-grained, dynamic ways, there is much more work to be done. Chapters in this volume make considerable progress, through the three main themes expressed across these chapters. These are: Zooming out to understand routines in larger contexts; Zooming in to reveal actor dispositions and skill; and Innovation, creativity and routines in ambiguous contexts….(More)”

What Should We Do About Big Data Leaks?

Curated on April 6, 2016October 10, 2018 by Stefaan Verhulst

Paul Ford at the New Republic: “I have a great fondness for government data, and the government has a great fondness for making more of it. Federal elections financial data, for example, with every contribution identified, connected to a name and address. Or the results of the census. I don’t know if you’ve ever had the experience of downloading census data but it’s pretty exciting. You can hold America on your hard drive! Meditate on the miracles of zip codes, the way the country is held together and addressable by arbitrary sets of digits.

You can download whole books, in PDF format, about the foreign policy of the Reagan Administration as it related to Russia. Negotiations over which door the Soviet ambassador would use to enter a building. Gigabytes and gigabytes of pure joy for the ephemeralist. The government is the greatest creator of ephemera ever.

Consider the Financial Crisis Inquiry Commission, or FCIC, created in 2009 to figure out exactly how the global economic pooch was screwed. The FCIC has made so much data, and has done an admirable job (caveats noted below) of arranging it. So much stuff. There are reams of treasure on a single FCIC web site, hosted at Stanford Law School: Hundreds of MP3 files, for example, with interviews with Jamie Dimonof JPMorgan Chase and Lloyd Blankfein of Goldman Sachs. I am desperate to find time to write some code that automatically extracts random audio snippets from each and puts them on top of a slow ambient drone with plenty of reverb, so that I can relax to the dulcet tones of the financial industry explaining away its failings. (There’s a Paul Krugman interview that I assume is more critical.)

The recordings are just the beginning. They’ve released so many documents, and with the documents, a finding aid that you can download in handy PDF format, which will tell you where to, well, find things, pointing to thousands of documents. That aid alone is 1,439 pages.

Look, it is excellent that this exists, in public, on the web. But it also presents a very contemporary problem: What is transparency in the age of massive database drops? The data is available, but locked in MP3s and PDFs and other documents; it’s not searchable in the way a web page is searchable, not easy to comment on or share.

Consider the WikiLeaks release of State Department cables. They were exhausting, there were so many of them, they were in all caps. Or the trove of data Edward Snowden gathered on aUSB drive, or Chelsea Manning on CD. And the Ashley Madison leak, spread across database files and logs of credit card receipts. The massive and sprawling Sony leak, complete with whole email inboxes. And with the just-released Panama Papers, we see two exciting new developments: First, the consortium of media organizations that managed the leak actually came together and collectively, well, branded the papers, down to a hashtag (#panamapapers), informational website, etc. Second, the size of the leak itself—2.5 terabytes!—become a talking point, even though that exact description of what was contained within those terabytes was harder to understand. This, said the consortia of journalists that notably did not include The New York Times, The Washington Post, etc., is the big one. Stay tuned. And we are. But the fact remains: These artifacts are not accessible to any but the most assiduous amateur conspiracist; they’re the domain of professionals with the time and money to deal with them. Who else could be bothered?

If you watched the movie Spotlight, you saw journalists at work, pawing through reams of documents, going through, essentially, phone books. I am an inveterate downloader of such things. I love what they represent. And I’m also comfortable with many-gigabyte corpora spread across web sites. I know how to fetch data, how to consolidate it, and how to search it. I share this skill set with many data journalists, and these capacities have, in some ways, become the sole province of the media. Organs of journalism are among the only remaining cultural institutions that can fund investigations of this size and tease the data apart, identifying linkages and thus constructing informational webs that can, with great effort, be turned into narratives, yielding something like what we call “a story” or “the truth.”

Spotlight was set around 2001, and it features a lot of people looking at things on paper. The problem has changed greatly since then: The data is everywhere. The media has been forced into a new cultural role, that of the arbiter of the giant and semi-legal database. ProPublica, a nonprofit that does a great deal of data gathering and data journalism and then shares its findings with other media outlets, is one example; it funded a project called DocumentCloud with other media organizations that simplifies the process of searching through giant piles of PDFs (e.g., court records, or the results of Freedom of Information Act requests).

At some level the sheer boredom and drudgery of managing these large data leaks make them immune to casual interest; even the Ashley Madison leak, which I downloaded, was basically an opaque pile of data and really quite boring unless you had some motive to poke around.

If this is the age of the citizen journalist, or at least the citizen opinion columnist, it’s also the age of the data journalist, with the news media acting as product managers of data leaks, making the information usable, browsable, attractive. There is an uneasy partnership between leakers and the media, just as there is an uneasy partnership between the press and the government, which would like some credit for its efforts, thank you very much, and wouldn’t mind if you gave it some points for transparency while you’re at it.

Pause for a second. There’s a glut of data, but most of it comes to us in ugly formats. What would happen if the things released in the interest of transparency were released in actual transparent formats?…(More)”

Can Data Literacy Protect Us from Misleading Political Ads?

Curated on April 6, 2016August 3, 2018 by Stefaan Verhulst

Walter Frick at Harvard Business Review: “It’s campaign season in the U.S., and politicians have no compunction about twisting facts and figures, as a quick skim of the fact-checking website Politifact illustrates.

Can data literacy guard against the worst of these offenses? Maybe, according to research.

There is substantial evidence that numeracy can aid critical thinking, and some reason to think it can help in the political realm, within limits. But there is also evidence that numbers can mislead even data-savvy people when it’s in service of those people’s politics.

In a study published at the end of last year, Vittorio Merola of Ohio State University and Matthew Hitt of Louisiana State examined how numeracy might guard against partisan messaging. They showed participants information comparing the costs of probation and prison, and then asked whether participants agreed with the statement, “Probation should be used as an alternative form of punishment, instead of prison, for felons.”

Some of the participants were shown highly relevant numeric information arguing for the benefits of probation: that it costs less and has a better cost-benefit ratio, and that the cost of U.S. prisons has been rising. Another group was shown weaker, less-relevant numeric information. This message didn’t contain anything about the costs or benefits of parole, and instead compared prison costs to transportation spending, with no mention of why these might be at all related. The experiment also varied whether the information was supposedly from a study commissioned by Democrats or Republicans.

The researchers scored participants’ numeracy by asking questions like, “The chance of getting a viral infection is 0.0005. Out of 10,000 people, about how many of them are expected to get infected?”

For participants who scored low in numeracy, their support depended more on the political party making the argument than on the strength of the data. When the information came from those participants’ own party, they were more likely to agree with it, no matter whether it was weak or strong.

By contrast, participants who scored higher in numeracy were persuaded by the stronger numeric information, even when it came from the other party. The results held up even after accounting for participants’ education, among other variables….

In 2013, Dan Kahan of Yale and several colleagues conducted a study in which they asked participants to draw conclusions from data. In one group, the data was about a treatment for skin rashes, a nonpolitical topic. Another group was asked to evaluate data on gun control, comparing crime rates for cities that have banned concealed weapons to cities that haven’t.

Additionally, in the skin rash group some participants were shown data indicating that the use of skin cream correlated with rashes getting better, while some were shown the opposite. Similarly, some in the gun control group were shown less crime in cities that have banned concealed weapons, while some were shown the reverse…. They found that highly numerate people did better than less-numerate ones in drawing the correct inference in the skin rash case. But comfort with numbers didn’t seem to help when it came to gun control. In fact, highly numerate participants were more polarized over the gun control data than less-numerate ones. The reason seemed to be that the numerate participants used their skill with data selectively, employing it only when doing so helped them reach a conclusion that fit with their political ideology.

Two other lines of research are relevant here.

First, work by Philip Tetlock and Barbara Mellers of the University of Pennsylvania suggests that numerate people tend to make better forecasts, including about geopolitical events. They’ve also documented that even very basic training in probabilistic thinking can improve one’s forecasting accuracy. And this approach works best, Tetlock argues, when it’s part of a whole style of thinking that emphasizes multiple points of view.

Second, two papers, one from the University of Texas at Austin and one from Princeton, found that partisan bias can be diminished with incentives: People are more likely to report factually correct beliefs about the economy when money is on the line…..(More)”

Social app for refugees and locals translates in real-time

Curated on April 6, 2016October 23, 2018 by Stefaan Verhulst

Springwise: “Europe is in the middle of a major refugee crisis, with more than one million migrants arriving in 2015 alone. Now, developers in Stockholm are coming up with new ways for arrivals to integrate into their new homes.

Welcome! is an app based in Sweden, a country that has operated a broadly open policy to immigration in recent years. The developers say the app aims to break down social and language barriers between Swedes and refugees. Welcome! is translated into Arabic, Persian, Swedish and English, and it enables users to create, host and join activities, as well as ask questions of locals, chat with new contacts, and browse events that are nearby.

The idea is to solve one of the major difficulties for immigrants arriving in Europe by encouraging the new arrivals and locals to interact and connect, helping the refugees to settle in. The app offers real-time auto-translation through its four languages, and can be downloaded for iOS and Android….We have already seen an initiative in Finland helping to set up startups with refugees…(More)“