What we can learn from the failure of Google Flu Trends


David Lazer and Ryan Kennedy at Wired: “….The issue of using big data for the common good is far more general than Google—which deserves credit, after all, for offering the occasional peek at their data. These records exist because of a compact between individual consumers and the corporation. The legalese of that compact is typically obscure (how many people carefully read terms and conditions?), but the essential bargain is that the individual gets some service, and the corporation gets some data.

What is left out that bargain is the public interest. Corporations and consumers are part of a broader society, and many of these big data archives offer insights that could benefit us all. As Eric Schmidt, CEO of Google, has said, “We must remember that technology remains a tool of humanity.” How can we, and corporate giants, then use these big data archives as a tool to serve humanity?

Google’s sequel to GFT, done right, could serve as a model for collaboration around big data for the public good. Google is making flu-related search data available to the CDC as well as select research groups. A key question going forward will be whether Google works with these groups to improve the methodology underlying GFT. Future versions should, for example, continually update the fit of the data to flu prevalence—otherwise, the value of the data stream will rapidly decay.

This is just an example, however, of the general challenge of how to build models of collaboration amongst industry, government, academics, and general do-gooders to use big data archives to produce insights for the public good. This came to the fore with the struggle (and delay) for finding a way to appropriately share mobile phone data in west Africa during the Ebola epidemic (mobile phone data are likely the best tool for understanding human—and thus Ebola—movement). Companies need to develop efforts to share data for the public good in a fashion that respects individual privacy.

There is not going to be a single solution to this issue, but for starters, we are pushing for a “big data” repository in Boston to allow holders of sensitive big data to share those collections with researchers while keeping them totally secure. The UN has its Global Pulse initiative, setting up collaborative data repositories around the world. Flowminder, based in Sweden, is a nonprofit dedicated to gathering mobile phone data that could help in response to disasters. But these are still small, incipient, and fragile efforts.

The question going forward now is how build on and strengthen these efforts, while still guarding the privacy of individuals and the proprietary interests of the holders of big data….(More)”

Researchers wrestle with a privacy problem


Erika Check Hayden at Nature: “The data contained in tax returns, health and welfare records could be a gold mine for scientists — but only if they can protect people’s identities….In 2011, six US economists tackled a question at the heart of education policy: how much does great teaching help children in the long run?

They started with the records of more than 11,500 Tennessee schoolchildren who, as part of an experiment in the 1980s, had been randomly assigned to high- and average-quality teachers between the ages of five and eight. Then they gauged the children’s earnings as adults from federal tax returns filed in the 2000s. The analysis showed that the benefits of a good early education last for decades: each year of better teaching in childhood boosted an individual’s annual earnings by some 3.5% on average. Other data showed the same individuals besting their peers on measures such as university attendance, retirement savings, marriage rates and home ownership.

The economists’ work was widely hailed in education-policy circles, and US President Barack Obama cited it in his 2012 State of the Union address when he called for more investment in teacher training.

But for many social scientists, the most impressive thing was that the authors had been able to examine US federal tax returns: a closely guarded data set that was then available to researchers only with tight restrictions. This has made the study an emblem for both the challenges and the enormous potential power of ‘administrative data’ — information collected during routine provision of services, including tax returns, records of welfare benefits, data on visits to doctors and hospitals, and criminal records. Unlike Internet searches, social-media posts and the rest of the digital trails that people establish in their daily lives, administrative data cover entire populations with minimal self-selection effects: in the US census, for example, everyone sampled is required by law to respond and tell the truth.

This puts administrative data sets at the frontier of social science, says John Friedman, an economist at Brown University in Providence, Rhode Island, and one of the lead authors of the education study “They allow researchers to not just get at old questions in a new way,” he says, “but to come at problems that were completely impossible before.”….

But there is also concern that the rush to use these data could pose new threats to citizens’ privacy. “The types of protections that we’re used to thinking about have been based on the twin pillars of anonymity and informed consent, and neither of those hold in this new world,” says Julia Lane, an economist at New York University. In 2013, for instance, researchers showed that they could uncover the identities of supposedly anonymous participants in a genetic study simply by cross-referencing their data with publicly available genealogical information.

Many people are looking for ways to address these concerns without inhibiting research. Suggested solutions include policy measures, such as an international code of conduct for data privacy, and technical methods that allow the use of the data while protecting privacy. Crucially, notes Lane, although preserving privacy sometimes complicates researchers’ lives, it is necessary to uphold the public trust that makes the work possible.

“Difficulty in access is a feature, not a bug,” she says. “It should be hard to get access to data, but it’s very important that such access be made possible.” Many nations collect administrative data on a massive scale, but only a few, notably in northern Europe, have so far made it easy for researchers to use those data.

In Denmark, for instance, every newborn child is assigned a unique identification number that tracks his or her lifelong interactions with the country’s free health-care system and almost every other government service. In 2002, researchers used data gathered through this identification system to retrospectively analyse the vaccination and health status of almost every child born in the country from 1991 to 1998 — 537,000 in all. At the time, it was the largest study ever to disprove the now-debunked link between measles vaccination and autism.

Other countries have begun to catch up. In 2012, for instance, Britain launched the unified UK Data Service to facilitate research access to data from the country’s census and other surveys. A year later, the service added a new Administrative Data Research Network, which has centres in England, Scotland, Northern Ireland and Wales to provide secure environments for researchers to access anonymized administrative data.

In the United States, the Census Bureau has been expanding its network of Research Data Centers, which currently includes 19 sites around the country at which researchers with the appropriate permissions can access confidential data from the bureau itself, as well as from other agencies. “We’re trying to explore all the available ways that we can expand access to these rich data sets,” says Ron Jarmin, the bureau’s assistant director for research and methodology.

In January, a group of federal agencies, foundations and universities created the Institute for Research on Innovation and Science at the University of Michigan in Ann Arbor to combine university and government data and measure the impact of research spending on economic outcomes. And in July, the US House of Representatives passed a bipartisan bill to study whether the federal government should provide a central clearing house of statistical administrative data.

Yet vast swathes of administrative data are still inaccessible, says George Alter, director of the Inter-university Consortium for Political and Social Research based at the University of Michigan, which serves as a data repository for approximately 760 institutions. “Health systems, social-welfare systems, financial transactions, business records — those things are just not available in most cases because of privacy concerns,” says Alter. “This is a big drag on research.”…

Many researchers argue, however, that there are legitimate scientific uses for such data. Jarmin says that the Census Bureau is exploring the use of data from credit-card companies to monitor economic activity. And researchers funded by the US National Science Foundation are studying how to use public Twitter posts to keep track of trends in phenomena such as unemployment.

 

….Computer scientists and cryptographers are experimenting with technological solutions. One, called differential privacy, adds a small amount of distortion to a data set, so that querying the data gives a roughly accurate result without revealing the identity of the individuals involved. The US Census Bureau uses this approach for its OnTheMap project, which tracks workers’ daily commutes. ….In any case, although synthetic data potentially solve the privacy problem, there are some research applications that cannot tolerate any noise in the data. A good example is the work showing the effect of neighbourhood on earning potential3, which was carried out by Raj Chetty, an economist at Harvard University in Cambridge, Massachusetts. Chetty needed to track specific individuals to show that the areas in which children live their early lives correlate with their ability to earn more or less than their parents. In subsequent studies5, Chetty and his colleagues showed that moving children from resource-poor to resource-rich neighbourhoods can boost their earnings in adulthood, proving a causal link.

Secure multiparty computation is a technique that attempts to address this issue by allowing multiple data holders to analyse parts of the total data set, without revealing the underlying data to each other. Only the results of the analyses are shared….(More)”

Ethical, Safe, and Effective Digital Data Use in Civil Society


Blog by Lucy Bernholz, Rob Reich, Emma Saunders-Hastings, and Emma Leeds Armstrong: “How do we use digital data ethically, safely, and effectively in civil society. We have developed three early principles for consideration:

  • Default to person-centered consent.
  • Prioritize privacy and minimum viable data collection.
  • Plan from the beginning to open (share) your work.

This post provides a synthesis from a one day workshop that informed these principles. It concludes with links to draft guidelines you can use to inform partnerships between data consultants/volunteers and nonprofit organizations….(More)

These three values — consent, minimum viable data collection, and open sharing- comprise a basic framework for ethical, safe, and effective use of digital data by civil society organizations. They should be integrated into partnerships with data intermediaries and, perhaps, into general data practices in civil society.

We developed two tools to guide conversations between data volunteers and/or consultants and nonprofits. These are downloadable below. Please use them, share them, improve them, and share them again….

  1. Checklist for NGOs and external data consultants
  2. Guidelines for NGOs and external data consultants (More)”

Research on digital identity ecosystems


Francesca Bria et al at NESTA/D-CENT: “This report presents a concrete analysis of the latest evolution of the identity ecosystem in the big data context, focusing on the economic and social value of data and identity within the current digital economy. This report also outlines economic, policy, and technical alternatives to develop an identity ecosystem and management of data for the common good that respects citizens’ rights, privacy and data protection.

Key findings

  • This study presents a review of the concept of identity and a map of the key players in the identity industry (such as data brokers and data aggregators), including empirical case studies of identity management in key sectors.
    ….
  • The “datafication” of individuals’ social lives, thoughts and moves is a valuable commodity and constitutes the backbone of the “identity market” within which “data brokers” (collectors, purchasers or sellers) play key different roles in creating the market by offering various services such as fraud, customer relation, predictive analytics, marketing and advertising.
  • Economic, political and technical alternatives for identity to preserve trust, privacy and data ownership in today’s big data environments are formulated. The report looks into access to data, economic strategies to manage data as commons, consent and licensing, tools to control data, and terms of services. It also looks into policy strategies such as privacy and data protection by design and trust and ethical frameworks. Finally, it assesses technical implementations looking at identity and anonymity, cryptographic tools; security; decentralisation and blockchains. It also analyses the future steps needed in order to move into the suggested technical strategies….(More)”

Data Collaboratives: Sharing Public Data in Private Hands for Social Good


Beth Simone Noveck (The GovLab) in Forbes: “Sensor-rich consumer electronics such as mobile phones, wearable devices, commercial cameras and even cars are collecting zettabytes of data about the environment and about us. According to one McKinsey study, the volume of data is growing at fifty percent a year. No one needs convincing that these private storehouses of information represent a goldmine for business, but these data can do double duty as rich social assets—if they are shared wisely.

Think about a couple of recent examples: Sharing data held by businesses and corporations (i.e. public data in private hands) can help to improve policy interventions. California planners make water allocation decisions based upon expertise, data and analytical tools from public and private sources, including Intel, the Earth Research Institute at the University of California at Santa Barbara, and the World Food Center at the University of California at Davis.

In Europe, several phone companies have made anonymized datasets available, making it possible for researchers to track calling and commuting patterns and gain better insight into social problems from unemployment to mental health. In the United States, LinkedIn is providing free data about demand for IT jobs in different markets which, when combined with open data from the Department of Labor, helps communities target efforts around training….

Despite the promise of data sharing, these kind of data collaboratives remain relatively new. There is a need toaccelerate their use by giving companies strong tax incentives for sharing data for public good. There’s a need for more study to identify models for data sharing in ways that respect personal privacy and security and enable companies to do well by doing good. My colleagues at The GovLab together with UN Global Pulse and the University of Leiden, for example, published this initial analysis of terms and conditions used when exchanging data as part of a prize-backed challenge. We also need philanthropy to start putting money into “meta research;” it’s not going to be enough to just open up databases: we need to know if the data is good.

After years of growing disenchantment with closed-door institutions, the push for greater use of data in governing can be seen as both a response and as a mirror to the Big Data revolution in business. Although more than 1,000,000 government datasets about everything from air quality to farmers markets are openly available online in downloadable formats, much of the data about environmental, biometric, epidemiological, and physical conditions rest in private hands. Governing better requires a new empiricism for developing solutions together. That will depend on access to these private, not just public data….(More)”

(US) Administration Announces New “Smart Cities” Initiative to Help Communities Tackle Local Challenges and Improve City Services


Factsheet from the White House: “Today, the Administration is announcing a new “Smart Cities” Initiative that will invest over $160 million in federal research and leverage more than 25 new technology collaborations to help local communities tackle key challenges such as reducing traffic congestion, fighting crime, fostering economic growth, managing the effects of a changing climate, and improving the delivery of city services. The new initiative is part of this Administration’s overall commitment to target federal resources to meet local needs and support community-led solutions.

Over the past six years, the Administration has pursued a place-based approach to working with communities as they tackle a wide range of challenges, from investing in infrastructure and filling open technology jobs to bolstering community policing. Advances in science and technology have the potential to accelerate these efforts. An emerging community of civic leaders, data scientists, technologists, and companies are joining forces to build “Smart Cities” – communities that are building an infrastructure to continuously improve the collection, aggregation, and use of data to improve the life of their residents – by harnessing the growing data revolution, low-cost sensors, and research collaborations, and doing so securely to protect safety and privacy.

As part of the initiative, the Administration is announcing:

  • More than $35 million in new grants and over $10 million in proposed investments to build a research infrastructure for Smart Cities by the National Science Foundation and National Institute of Standards and Technology.
  • Nearly $70 million in new spending and over $45 million in proposed investments to unlock new solutions in safety, energy, climate preparedness, transportation, health and more, by the Department of Homeland Security, Department of Transportation, Department of Energy, Department of Commerce, and the Environmental Protection Agency.
  • More than 20 cities participating in major new multi-city collaborations that will help city leaders effectively collaborate with universities and industry.

Today, the Administration is also hosting a White House Smart Cities Forum, coinciding with Smart Cities Week hosted by the Smart Cities Council, to highlight new steps and brainstorm additional ways that science and technology can support municipal efforts.

The Administration’s Smart Cities Initiative will begin with a focus on key strategies:

  • Creating test beds for “Internet of Things” applications and developing new multi-sector collaborative models: Technological advancements and the diminishing cost of IT infrastructure have created the potential for an “Internet of Things,” a ubiquitous network of connected devices, smart sensors, and big data analytics. The United States has the opportunity to be a global leader in this field, and cities represent strong potential test beds for development and deployment of Internet of Things applications. Successfully deploying these and other new approaches often depends on new regional collaborations among a diverse array of public and private actors, including industry, academia, and various public entities.
  • Collaborating with the civic tech movement and forging intercity collaborations: There is a growing community of individuals, entrepreneurs, and nonprofits interested in harnessing IT to tackle local problems and work directly with city governments. These efforts can help cities leverage their data to develop new capabilities. Collaborations across communities are likewise indispensable for replicating what works in new places.
  • Leveraging existing Federal activity: From research on sensor networks and cybersecurity to investments in broadband infrastructure and intelligent transportation systems, the Federal government has an existing portfolio of activities that can provide a strong foundation for a Smart Cities effort.
  • Pursuing international collaboration: Fifty-four percent of the world’s population live in urban areas. Continued population growth and urbanization will add 2.5 billion people to the world’s urban population by 2050. The associated climate and resource challenges demand innovative approaches. Products and services associated with this market present a significant export opportunity for the U.S., since almost 90 percent of this increase will occur in Africa and Asia.

Complementing this effort, the President’s Council of Advisors on Science and Technology is examining how a variety of technologies can enhance the future of cities and the quality of life for urban residents. The Networking and Information Technology Research and Development (NITRD) Program is also announcing the release of a new framework to help coordinate Federal agency investments and outside collaborations that will guide foundational research and accelerate the transition into scalable and replicable Smart City approaches. Finally, the Administration’s growing work in this area is reflected in the Science and Technology Priorities Memo, issued by the Office of Management and Budget and Office of Science and Technology Policy in preparation for the President’s 2017 budget proposal, which includes a focus on cyber-physical systems and Smart Cities….(More)”

The impact of Open Data


GovLab/Omidyar Network: “…share insights gained from our current collaboration with Omidyar Network on a series of open data case studies. These case studies – 19, in total – are designed to provide a detailed examination of the various ways open data is being used around the world, across geographies and sectors, and to draw some over-arching lessons. The case studies are built from extensive research, including in-depth interviews with key participants in the various open data projects under study….

Ways in which open data impacts lives

Broadly, we have identified four main ways in which open data is transforming economic, social, cultural and political life, and hence improving people’s lives.

  • First, open data is improving government, primarily by helping tackle corruption, improving transparency, and enhancing public services and resource allocation.
  • Open data is also empowering citizens to take control of their lives and demand change; this dimension of impact is mediated by more informed decision making and new forms of social mobilization, both facilitated by new ways of communicating and accessing information.
  • Open data is also creating new opportunities for citizens and groups, by stimulating innovation and promoting economic growth and development.
  • Finally, open data is playing an increasingly important role insolving big public problems, primarily by allowing citizens and policymakers to engage in new forms of data-driven assessment and data-driven engagement.

 

Enabling Conditions

While these are the four main ways in which open data is driving change, we have seen wide variability in the amount and nature of impact across our case studies. Put simply, some projects are more successful than others; or some projects might be more successful in a particular dimension of impact, and less successful in others.

As part of our research, we have therefore tried to identify some enabling conditions that maximize the positive impact of open data projects. These four stand out:

  • Open data projects are most successful when they are built not from the efforts of single organizations or government agencies, but when they emerge from partnerships across sectors (and even borders). The role of intermediaries (e.g., the media and civil society groups) and “data collaboratives” are particularly important.
  • Several of the projects we have seen have emerged on the back of what we might think of as an open data public infrastructure– i.e., the technical backend and organizational processes necessary to enable the regular release of potentially impactful data to the public.
  • Clear open data policies, including well-defined performance metrics, are also essential; policymakers and political leaders have an important role in creating an enabling (yet flexible) legal environment that includes mechanisms for project assessments and accountability, as well as providing the type high-level political buy-in that can empower practitioners to work with open data.
  • We have also seen that the most successful open data projects tend to be those that target a well-defined problem or issue. In other words, projects with maximum impact often meet a genuine citizen need.

 

Challenges

Impact is also determined by the obstacles and challenges that a project confronts. Some regions and some projects face a greater number of hurdles. These also vary, but we have found four challenges that appear most often in our case studies:

  • Projects in countries or regions with low capacity or “readiness”(indicated, for instance by low Internet penetration rates or hostile political environments) typically fare less well.
  • Projects that are unresponsive to feedback and user needs are less likely to succeed than those that are flexible and able to adapt to what their users want.
  • Open data often exists in tension with risks such as privacy and security; often, the impact of a project is limited or harmed when it fails to take into account and mitigate these risks.
  • Although open data projects are often “hackable” and cheap to get off the ground, the most successful do require investments – of time and money – after their launch; inadequate resource allocation is one of the most common reasons for a project to fail.

These lists of impacts, enabling factors and challenges are, of course, preliminary. We continue to refine our research and will include a final set of findings along with our final report….(More)

On the Farm: Startups Put Data in Farmers’ Hands


Jacob Bunge at the Wall Street Journal: “Farmers and entrepreneurs are starting to compete with agribusiness giants over the newest commodity being harvested on U.S. farms—one measured in bytes, not bushels.

Startups including Farmobile LLC, Granular Inc. and Grower Information Services Cooperative are developing computer systems that will enable farmers to capture data streaming from their tractors and combines, store it in digital silos and market it to agriculture companies or futures traders. Such platforms could allow farmers to reap larger profits from a technology revolution sweeping the U.S. Farm Belt and give them more control over the information generated on their fields.

The efforts in some cases would challenge a wave of data-analysis tools from big agricultural companies such as Monsanto Co., DuPontCo., Deere & Co. and Cargill Inc. Those systems harness modern planters, combines and other machinery outfitted with sensors to track planting, spraying and harvesting, then crunch that data to provide farm-management guidance that these firms say can help farmers curb costs and grow larger crops. The companies say farmers own their data, and it won’t be sold to third parties.

Some farmers and entrepreneurs say crop producers can get the most from their data by compiling and analyzing it themselves—for instance, to determine the best time to apply fertilizer to their soil and how much. Then, farmers could profit further by selling data to seed, pesticide and equipment makers seeking a glimpse into how and when farmers use machinery and crop supplies.

The new ventures come as farmers weigh the potential benefits of sharing their data with large agricultural firms against privacy concerns and fears that agribusinesses could leverage farm-level information to charge higher rates for seeds, pesticides and other supplies.

“We need to get farmers involved in this because it’s their information,” said Dewey Hukill, board president of Grower Information Services Cooperative, or GISC, a farmer-owned cooperative that is building a platform to collect its members’ data. The cooperative has signed up about 1,500 members across 37 states….

Companies developing markets for farm data say it’s not their intention to displace big seed and machinery suppliers but to give farmers a platform that would enable them to manage their own information. Storing and selling their own data wouldn’t necessarily bar a farmer from sharing information with a seed company to get a planting recommendation, they say….(More)”

 

A data revolution is underway. Will NGOs miss the boat?


Opinion by Sophia Ayele at Oxfam: “The data revolution has arrived. ….The UN has even launched a Data Revolution Group (to ensure that the revolution penetrates into international development). The Group’s 2014 report suggests that harnessing the power of newly available data could ultimately lead to, “more empowered people, better policies, better decisions and greater participation and accountability, leading to better outcomes for people and the planet.”

But where do NGOs fit in?

NGOs are generating dozens (if not hundreds) of datasets every year. Over the last two decades, NGO have been collecting increasing amounts of research and evaluation data, largely driven by donor demands for more rigorous evaluations of programs. The quality and efficiency of data collection has also been enhanced by mobile data collection. However, a quick scan of UK development NGOs reveals that few, if any, are sharing the data that they collect. This means that NGOs are generating dozens (if not hundreds) of datasets every year that aren’t being fully exploited and analysed. Working on tight budgets, with limited capacity, it’s not surprising that NGOs often shy away from sharing data without a clear mandate.

But change is in the air. Several donors have begun requiring NGOs to publicise data and others appear to be moving in that direction. Last year, USAID launched its Open Data Policy which requires that grantees “submit any dataset created or collected with USAID funding…” Not only does USAID stipulate this requirement, it also hosts this data on its Development Data Library (DDL) and provides guidance on anonymisation to depositors. Similarly, Gates Foundation’s 2015 Open Access Policy stipulates that, “Data underlying published research results will be accessible and open immediately.” However, they are allowing a two-year transition period…..Here at Oxfam, we have been exploring ways to begin sharing research and evaluation data. We aren’t being required to do this – yet – but, we realise that the data that we collect is a public good with the potential to improve lives through more effective development programmes and to raise the voices of those with whom we work. Moreover, organizations like Oxfam can play a crucial role in highlighting issues facing women and other marginalized communities that aren’t always captured in national statistics. Sharing data is also good practice and would increase our transparency and accountability as an organization.

… the data that we collect is a public good with the potential to improve lives. However, Oxfam also bears a huge responsibility to protect the rights of the communities that we work with. This involves ensuring informed consent when gathering data, so that communities are fully aware that their data may be shared, and de-identifying data to a level where individuals and households cannot be easily identified.

As Oxfam has outlined in our, recently adopted, Responsible Data Policy,”Using data responsibly is not just an issue of technical security and encryption but also of safeguarding the rights of people to be counted and heard, ensuring their dignity, respect and privacy, enabling them to make an informed decision and protecting their right to not be put at risk… (More)”

Anonymization and Risk


Paper by Ira Rubinstein and Woodrow Hartzog: “Perfect anonymization of data sets has failed. But the process of protecting data subjects in shared information remains integral to privacy practice and policy. While the deidentification debate has been vigorous and productive, there is no clear direction for policy. As a result, the law has been slow to adapt a holistic approach to protecting data subjects when data sets are released to others. Currently, the law is focused on whether an individual can be identified within a given set. We argue that the better locus of data release policy is on the process of minimizing the risk of reidentification and sensitive attribute disclosure. Process-based data release policy, which resembles the law of data security, will help us move past the limitations of focusing on whether data sets have been “anonymized.” It draws upon different tactics to protect the privacy of data subjects, including accurate deidentification rhetoric, contracts prohibiting reidentification and sensitive attribute disclosure, data enclaves, and query-based strategies to match required protections with the level of risk. By focusing on process, data release policy can better balance privacy and utility where nearly all data exchanges carry some risk….(More)”