There aren’t any rules on how social scientists use private data. Here’s why we need them.


 at SSRC: “The politics of social science access to data are shifting rapidly in the United States as in other developed countries. It used to be that states were the most important source of data on their citizens, economy, and society. States needed to collect and aggregate large amounts of information for their own purposes. They gathered this directly—e.g., through censuses of individuals and firms—and also constructed relevant indicators. Sometimes state agencies helped to fund social science projects in data gathering, such as the National Science Foundation’s funding of the American National Election Survey over decades. While scholars such as James Scott and John Brewer disagreed about the benefits of state data gathering, they recognized the state’s primary role.

In this world, the politics of access to data were often the politics of engaging with the state. Sometimes the state was reluctant to provide information, either for ethical reasons (e.g. the privacy of its citizens) or self-interest. However, democratic states did typically provide access to standard statistical series and the like, and where they did not, scholars could bring pressure to bear on them. This led to well-understood rules about the common availability of standard data for many research questions and built the foundations for standard academic practices. It was relatively easy for scholars to criticize each other’s work when they were drawing on common sources. This had costs—scholars tended to ask the kinds of questions that readily available data allowed them to ask—but also significant benefits. In particular, it made research more easily reproducible.

We are now moving to a very different world. On the one hand, open data initiatives in government are making more data available than in the past (albeit often without much in the way of background resources or documentation).The new universe of private data is reshaping social science research in some ways that are still poorly understood. On the other, for many research purposes, large firms such as Google or Facebook (or even Apple) have much better data than the government. The new universe of private data is reshaping social science research in some ways that are still poorly understood. Here are some of the issues that we need to think about:…(More)”

Bridging data gaps for policymaking: crowdsourcing and big data for development


 for the DevPolicyBlog: “…By far the biggest innovation in data collection is the ability to access and analyse (in a meaningful way) user-generated data. This is data that is generated from forums, blogs, and social networking sites, where users purposefully contribute information and content in a public way, but also from everyday activities that inadvertently or passively provide data to those that are able to collect it.

User-generated data can help identify user views and behaviour to inform policy in a timely way rather than just relying on traditional data collection techniques (census, household surveys, stakeholder forums, focus groups, etc.), which are often cumbersome, very costly, untimely, and in many cases require some form of approval or support by government.

It might seem at first that user-generated data has limited usefulness in a development context due to the importance of the internet in generating this data combined with limited internet availability in many places. However, U-Report is one example of being able to access user-generated data independent of the internet.

U-Report was initiated by UNICEF Uganda in 2011 and is a free SMS based platform where Ugandans are able to register as “U-Reporters” and on a weekly basis give their views on topical issues (mostly related to health, education, and access to social services) or participate in opinion polls. As an example, Figure 1 shows the result from a U-Report poll on whether polio vaccinators came to U-Reporter houses to immunise all children under 5 in Uganda, broken down by districts. Presently, there are more than 300,000 U-Reporters in Uganda and more than one million U-Reporters across 24 countries that now have U-Report. As an indication of its potential impact on policymaking,UNICEF claims that every Member of Parliament in Uganda is signed up to receive U-Report statistics.

Figure 1: U-Report Uganda poll results

Figure 1: U-Report Uganda poll results

U-Report and other platforms such as Ushahidi (which supports, for example, I PAID A BRIBE, Watertracker, election monitoring, and crowdmapping) facilitate crowdsourcing of data where users contribute data for a specific purpose. In contrast, “big data” is a broader concept because the purpose of using the data is generally independent of the reasons why the data was generated in the first place.

Big data for development is a new phrase that we will probably hear a lot more (see here [pdf] and here). The United Nations Global Pulse, for example, supports a number of innovation labs which work on projects that aim to discover new ways in which data can help better decision-making. Many forms of “big data” are unstructured (free-form and text-based rather than table- or spreadsheet-based) and so a number of analytical techniques are required to make sense of the data before it can be used.

Measures of Twitter activity, for example, can be a real-time indicator of food price crises in Indonesia [pdf] (see Figure 2 below which shows the relationship between food-related tweet volume and food inflation: note that the large volume of tweets in the grey highlighted area is associated with policy debate on cutting the fuel subsidy rate) or provide a better understanding of the drivers of immunisation awareness. In these examples, researchers “text-mine” Twitter feeds by extracting tweets related to topics of interest and categorising text based on measures of sentiment (positive, negative, anger, joy, confusion, etc.) to better understand opinions and how they relate to the topic of interest. For example, Figure 3 shows the sentiment of tweets related to vaccination in Kenya over time and the dates of important vaccination related events.

Figure 2: Plot of monthly food-related tweet volume and official food price statistics

Figure 2: Plot of monthly food-related Tweet volume and official food price statistics

Figure 3: Sentiment of vaccine related tweets in Kenya

Figure 3: Sentiment of vaccine-related tweets in Kenya

Another big data example is the use of mobile phone usage to monitor the movement of populations in Senegal in 2013. The data can help to identify changes in the mobility patterns of vulnerable population groups and thereby provide an early warning system to inform humanitarian response effort.

The development of mobile banking too offers the potential for the generation of a staggering amount of data relevant for development research and informing policy decisions. However, it also highlights the public good nature of data collected by public and private sector institutions and the reliance that researchers have on them to access the data. Building trust and a reputation for being able to manage privacy and commercial issues will be a major challenge for researchers in this regard….(More)”

Priorities for the National Privacy Research Strategy


James Kurose and Keith Marzullo at the White House: “Vast improvements in computing and communications are creating new opportunities for improving life and health, eliminating barriers to education and employment, and enabling advances in many sectors of the economy. The promise of these new applications frequently comes from their ability to create, collect, process, and archive information on a massive scale.

However, the rapid increase in the quantity of personal information that is being collected and retained, combined with our increased ability to analyze and combine it with other information, is creating concerns about privacy. When information about people and their activities can be collected, analyzed, and repurposed in so many ways, it can create new opportunities for crime, discrimination, inadvertent disclosure, embarrassment, and harassment.

This Administration has been a strong champion of initiatives to improve the state of privacy, such as the “Consumer Privacy Bill of Rights” proposal and the creation of the Federal Privacy Council. Similarly, the White House report Big Data: Seizing Opportunities, Preserving Values highlights the need for large-scale privacy research, stating: “We should dramatically increase investment for research and development in privacy-enhancing technologies, encouraging cross-cutting research that involves not only computer science and mathematics, but also social science, communications and legal disciplines.”

Today, we are pleased to release the National Privacy Research Strategy. Research agencies across government participated in the development of the strategy, reviewing existing Federal research activities in privacy-enhancing technologies, soliciting inputs from the private sector, and identifying priorities for privacy research funded by the Federal Government. The National Privacy Research Strategy calls for research along a continuum of challenges, from how people understand privacy in different situations and how their privacy needs can be formally specified, to how these needs can be addressed, to how to mitigate and remediate the effects when privacy expectations are violated. This strategy proposes the following priorities for privacy research:

  • Foster a multidisciplinary approach to privacy research and solutions;
  • Understand and measure privacy desires and impacts;
  • Develop system design methods that incorporate privacy desires, requirements, and controls;
  • Increase transparency of data collection, sharing, use, and retention;
  • Assure that information flows and use are consistent with privacy rules;
  • Develop approaches for remediation and recovery; and
  • Reduce privacy risks of analytical algorithms.

With this strategy, our goal is to produce knowledge and technology that will enable individuals, commercial entities, and the Federal Government to benefit from technological advancements and data use while proactively identifying and mitigating privacy risks. Following the release of this strategy, we are also launching a Federal Privacy R&D Interagency Working Group, which will lead the coordination of the Federal Government’s privacy research efforts. Among the group’s first public activities will be to host a workshop to discuss the strategic plan and explore directions of follow-on research. It is our hope that this strategy will also inspire parallel efforts in the private sector….(More)”

Reforms to improve U.S. government accountability


Alexander B. Howard and Patrice McDermott in Science: “Five decades after the United States first enacted the Freedom of Information Act (FOIA), Congress has voted to make the first major reforms to the statute since 2007. President Lyndon Johnson signed the first FOIA on 4 July 1966, enshrining in law the public’s right to access to information from executive branch government agencies. Scientists and others around the world can use the FOIA to learn what the U.S. government has done in its policies and practices. Proposed reforms should be a net benefit to public understanding of the scientific process and knowledge, by increasing the access of scientists to archival materials and reducing the likelihood of science and scientists being suppressed by official secrecy or bureaucracy.

Although the FOIA has been important for accountability, reform is sorely needed. An analysis of the 15 federal government agencies that received the most FOIA requests found poor to abysmal compliance rates (1, 2). In 2016, the Associated Press found that the Obama Administration had set a new record for unfulfilled FOIA requests (3). Although that has to be considered in the context of a rise in request volume without commensurate increases in resources to address them, researchers have found that most agencies simply ignore routine requests for travel schedules (4). An audit of 165 federal government agencies found that only 40% complied with the E-FOIA Act of 1996; just 67 of them had online libraries that were regularly updated with a substantial number of documents released under FOIA (5).

In the face of growing concerns about compliance, FOIA reform was one of the few recent instances of bicameral bipartisanship in Congress, with both the House and Senate each passing bills this spring with broad support. Now that Congress moved to send the Senate bill on to the president to sign into law, implementation of specific provisions will bear close scrutiny, including the potential impact of disclosure upon scientists who work in or with government agencies (6). Proposed revisions to the FOIA statute would improve how government discloses information to the public, while leaving intact exemptions for privacy, proprietary information, deliberative documents, and national security.

Features of Reforms

One of the major reforms in the House and Senate bills was to codify the “presumption of openness” outlined by President Obama the day after he took office in January 2009 when he declared that FOIA should be administered with a clear presumption: In the face of doubt, “openness” would prevail. This presumption of openness was affirmed by U.S. Attorney General Holder in March 2009. Although these declarations have had limited effect in the agencies (as described above), codifying these reforms into law is crucial not only to ensure that this remains executive branch policy after this president leaves office but also to provide requesters with legal force beyond an executive order….(More)”

Privacy concerns in smart cities


Liesbet van Zoonen in Government Information Quarterly: “In this paper a framework is constructed to hypothesize if and how smart city technologies and urban big data produce privacy concerns among the people in these cities (as inhabitants, workers, visitors, and otherwise). The framework is built on the basis of two recurring dimensions in research about people’s concerns about privacy: one dimensions represents that people perceive particular data as more personal and sensitive than others, the other dimension represents that people’s privacy concerns differ according to the purpose for which data is collected, with the contrast between service and surveillance purposes most paramount. These two dimensions produce a 2 × 2 framework that hypothesizes which technologies and data-applications in smart cities are likely to raise people’s privacy concerns, distinguishing between raising hardly any concern (impersonal data, service purpose), to raising controversy (personal data, surveillance purpose). Specific examples from the city of Rotterdam are used to further explore and illustrate the academic and practical usefulness of the framework. It is argued that the general hypothesis of the framework offers clear directions for further empirical research and theory building about privacy concerns in smart cities, and that it provides a sensitizing instrument for local governments to identify the absence, presence, or emergence of privacy concerns among their citizens….(More)”

Crowdsourcing privacy policy analysis: Potential, challenges and best practices


Paper by , and : “Privacy policies are supposed to provide transparency about a service’s data practices and help consumers make informed choices about which services to entrust with their personal information. In practice, those privacy policies are typically long and complex documents that are largely ignored by consumers. Even for regulators and data protection authorities privacy policies are difficult to assess at scale. Crowdsourcing offers the potential to scale the analysis of privacy policies with microtasks, for instance by assessing how specific data practices are addressed in privacy policies or extracting information about data practices of interest, which can then facilitate further analysis or be provided to users in more effective notice formats. Crowdsourcing the analysis of complex privacy policy documents to non-expert crowdworkers poses particular challenges. We discuss best practices, lessons learned and research challenges for crowdsourcing privacy policy analysis….(More)”

Big Data Challenges: Society, Security, Innovation and Ethics


Book edited by Bunnik, A., Cawley, A., Mulqueen, M., Zwitter, A: “This book brings together an impressive range of academic and intelligence professional perspectives to interrogate the social, ethical and security upheavals in a world increasingly driven by data. Written in a clear and accessible style, it offers fresh insights to the deep reaching implications of Big Data for communication, privacy and organisational decision-making. It seeks to demystify developments around Big Data before evaluating their current and likely future implications for areas as diverse as corporate innovation, law enforcement, data science, journalism, and food security. The contributors call for a rethinking of the legal, ethical and philosophical frameworks that inform the responsibilities and behaviours of state, corporate, institutional and individual actors in a more networked, data-centric society. In doing so, the book addresses the real world risks, opportunities and potentialities of Big Data….(More)”

City of Copenhagen launches data marketplace


Sarah Wray at TMForum: “The City of Copenhagen has launched its City Data Exchange to make public and private data accessible to power innovation.

The City Data Exchange is a new service to create a ‘marketplace for data’ from public and private data providers and allow monetization. The platform has been developed by Hitachi Insight Group.

“Data is the fuel powering our digital world, but in most cities it is unused,” said Hans Lindeman, Senior Vice President, Hitachi Insight Group, EMEA. “Even where data sits in public, freely accessible databases, the cost of extracting and processing it can easily outweigh the benefits.”

The City of Copenhagen is using guidelines for a data format that is safe, secure, ensures privacy and makes data easy to use. The City Data Exchange will only accept data that has been fully anonymized by the data supplier, for example.

According to Hitachi Insight Group, “All of this spares organizations the trouble and cost of extracting and processing data from multiple sources. At the same time, proprietary data can now become a business resource that can be monetized outside an organization.”

As a way to demonstrate how data from the City Data Exchange could be used in applications, Hitachi Insight Group is developing two applications:

  • Journey Insight, which helps citizens in the region to track their transportation usage over time and understand the carbon footprint of their travel
  • Energy Insight, which allows both households and businesses to see how much energy they use.

Both are set for public launch later this year.

Another example of how data marketplaces can enable innovation is the Mind My Business mobile app, developed by Vizalytics. It brings together all the data that can affect a retailer — from real-time information on how construction or traffic issues can hurt the footfall of a business, to timely reminders about taxes to pay or new regulations to meet. The “survival app for shopkeepers” makes full use of all the relevant data sources brought together by the City Data Exchange.

The platform will offer data in different categories such as: city life, infrastructure, climate and environment, business data and economy, demographics, housing and buildings, and utilities usage. It aims to meet the needs of local government, city planners, architects, retailers, telecoms networks, utilities, and all other companies and organizations who want to understand what makes Copenhagen, its businesses and its citizens tick.

“Smart cities need smart insights, and that’s only possible if everybody has all the facts at their disposal. The City Data Exchange makes that possible; it’s the solution that will help us all to create better public spaces and — for companies in Copenhagen — to offer better services and create jobs,” said Frank Jensen, the Lord Mayor of Copenhagen.

The City Data Exchange is currently offering raw data to its customers, and later this year will add analytical tools. The cost of gathering and processing the data will be recovered through subscription and service fees, which are expected to be much lower than the cost any company or city would face in performing the work of extracting, collecting and integrating the data by themselves….(More)”

Are we too obsessed with data?


Lauren Woodman of Nethope:” Data: Everyone’s talking about it, everyone wants more of it….

Still, I’d posit that we’re too obsessed with data. Not just us in the humanitarian space, of course, but everyone. How many likes did that Facebook post get? How many airline miles did I fly last year? How many hours of sleep did I get last week?…

The problem is that data by itself isn’t that helpful: information is.

We need to develop a new obsession, around making sure that data is actionable, that it is relevant in the context in which we work, and on making sure that we’re using the data as effectively as we are collecting it.

In my talk at ICT4D, I referenced the example of 7-Eleven in Japan. In the 1970s, 7-Eleven in Japan became independent from its parent, Southland Corporation. The CEO had to build a viable business in a tough economy. Every month, each store manager would receive reams of data, but it wasn’t effective until the CEO stripped out the noise and provided just four critical data points that had the greatest relevance to drive the local purchasing that each store was empowered to do on their own.

Those points – what sold the day before, what sold the same day a year ago, what sold the last time the weather was the same, and what other stores sold the day before – were transformative. Within a year, 7-Eleven had turned a corner, and for 30 years, remained the most profitable retailer in Japan. It wasn’t about the Big Data; it was figuring out what data was relevant, actionable and empowered local managers to make nimble decisions.

For our sector to get there, we need to do the front-end work that transforms our data into information that we can use. That, after all, is where the magic happens.

A few examples provide more clarity as to why this is so critical.

We know that adaptive decision-making requires access to real-time data. By knowing what is happening in real-time, or near-real-time, we can adjust our approaches and interventions to be most impactful. But to do so, our data has to be accessible to those that are empowered to make decisions. To achieve that, we have to make investments in training, infrastructure, and capacity-building at the organizational level.  But in the nonprofit sector, such investments are rarely supported by donors and beyond the limited unrestricted funding available to most most organizations. As a result, the sector has, so far, been able to take only limited steps towards effective data usage, hampering our ability to transform the massive amounts of data we have into useful information.

Another big question about data, and particularly in the humanitarian space, is whether it should be open, closed or somewhere in between. Privacy is certainly paramount, and for types of data, the need for close protection is very clear. For many other data, however, the rules are far less clear. Every country has its own rules about how data can and cannot be used or shared, and more work is needed to provide clarity and predictability so that appropriate data-sharing can evolve.

And perhaps more importantly, we need to think about not just the data, but the use cases.  Most of us would agree, for example, that sharing information during a crisis situation can be hugely beneficial to the people and the communities we serve – but in a world where rules are unclear, that ambiguity limits what we can do with the data we have. Here again, the context in which data will be used is critically important.

Finally, all of in the sector have to realize that the journey to transforming data into information is one we’re on together. We have to be willing to give and take. Having data is great; sharing information is better. Sometimes, we have to co-create that basis to ensure we all benefit….(More)”

Selected Readings on Data Collaboratives


By Neil Britto, David Sangokoya, Iryna Susha, Stefaan Verhulst and Andrew Young

The Living Library’s Selected Readings series seeks to build a knowledge base on innovative approaches for improving the effectiveness and legitimacy of governance. This curated and annotated collection of recommended works on the topic of data collaboratives was originally published in 2017.

The term data collaborative refers to a new form of collaboration, beyond the public-private partnership model, in which participants from different sectors (including private companies, research institutions, and government agencies ) can exchange data to help solve public problems. Several of society’s greatest challenges — from addressing climate change to public health to job creation to improving the lives of children — require greater access to data, more collaboration between public – and private-sector entities, and an increased ability to analyze datasets. In the coming months and years, data collaboratives will be essential vehicles for harnessing the vast stores of privately held data toward the public good.

Selected Reading List (in alphabetical order)

Annotated Selected Readings List (in alphabetical order)

Agaba, G., Akindès, F., Bengtsson, L., Cowls, J., Ganesh, M., Hoffman, N., . . . Meissner, F. “Big Data and Positive Social Change in the Developing World: A White Paper for Practitioners and Researchers.” 2014. http://bit.ly/25RRC6N.

  • This white paper, produced by “a group of activists, researchers and data experts” explores the potential of big data to improve development outcomes and spur positive social change in low- and middle-income countries. Using examples, the authors discuss four areas in which the use of big data can impact development efforts:
    • Advocating and facilitating by “opening[ing] up new public spaces for discussion and awareness building;
    • Describing and predicting through the detection of “new correlations and the surfac[ing] of new questions;
    • Facilitating information exchange through “multiple feedback loops which feed into both research and action,” and
    • Promoting accountability and transparency, especially as a byproduct of crowdsourcing efforts aimed at “aggregat[ing] and analyz[ing] information in real time.
  • The authors argue that in order to maximize the potential of big data’s use in development, “there is a case to be made for building a data commons for private/public data, and for setting up new and more appropriate ethical guidelines.”
  • They also identify a number of challenges, especially when leveraging data made accessible from a number of sources, including private sector entities, such as:
    • Lack of general data literacy;
    • Lack of open learning environments and repositories;
    • Lack of resources, capacity and access;
    • Challenges of sensitivity and risk perception with regard to using data;
    • Storage and computing capacity; and
    • Externally validating data sources for comparison and verification.

Ansell, C. and Gash, A. “Collaborative Governance in Theory and Practice.” Journal of Public Administration Research and  Theory 18 (4), 2008. http://bit.ly/1RZgsI5.

  • This article describes collaborative arrangements that include public and private organizations working together and proposes a model for understanding an emergent form of public-private interaction informed by 137 diverse cases of collaborative governance.
  • The article suggests factors significant to successful partnering processes and outcomes include:
    • Shared understanding of challenges,
    • Trust building processes,
    • The importance of recognizing seemingly modest progress, and
    • Strong indicators of commitment to the partnership’s aspirations and process.
  • The authors provide a ‘’contingency theory model’’ that specifies relationships between different variables that influence outcomes of collaborative governance initiatives. Three “core contingencies’’ for successful collaborative governance initiatives identified by the authors are:
    • Time (e.g., decision making time afforded to the collaboration);
    • Interdependence (e.g., a high degree of interdependence can mitigate negative effects of low trust); and
    • Trust (e.g. a higher level of trust indicates a higher probability of success).

Ballivian A, Hoffman W. “Public-Private Partnerships for Data: Issues Paper for Data Revolution Consultation.” World Bank, 2015. Available from: http://bit.ly/1ENvmRJ

  • This World Bank report provides a background document on forming public-prviate partnerships for data with the private sector in order to inform the UN’s Independent Expert Advisory Group (IEAG) on sustaining a “data revolution” in sustainable development.
  • The report highlights the critical position of private companies within the data value chain and reflects on key elements of a sustainable data PPP: “common objectives across all impacted stakeholders, alignment of incentives, and sharing of risks.” In addition, the report describes the risks and incentives of public and private actors, and the principles needed to “build[ing] the legal, cultural, technological and economic infrastructures to enable the balancing of competing interests.” These principles include understanding; experimentation; adaptability; balance; persuasion and compulsion; risk management; and governance.
  • Examples of data collaboratives cited in the report include HP Earth Insights, Orange Data for Development Challenges, Amazon Web Services, IBM Smart Cities Initiative, and the Governance Lab’s Open Data 500.

Brack, Matthew, and Tito Castillo. “Data Sharing for Public Health: Key Lessons from Other Sectors.” Chatham House, Centre on Global Health Security. April 2015. Available from: http://bit.ly/1DHFGVl

  • The Chatham House report provides an overview on public health surveillance data sharing, highlighting the benefits and challenges of shared health data and the complexity in adapting technical solutions from other sectors for public health.
  • The report describes data sharing processes from several perspectives, including in-depth case studies of actual data sharing in practice at the individual, organizational and sector levels. Among the key lessons for public health data sharing, the report strongly highlights the need to harness momentum for action and maintain collaborative engagement: “Successful data sharing communities are highly collaborative. Collaboration holds the key to producing and abiding by community standards, and building and maintaining productive networks, and is by definition the essence of data sharing itself. Time should be invested in establishing and sustaining collaboration with all stakeholders concerned with public health surveillance data sharing.”
  • Examples of data collaboratives include H3Africa (a collaboration between NIH and Wellcome Trust) and NHS England’s care.data programme.

de Montjoye, Yves-Alexandre, Jake Kendall, and Cameron F. Kerry. “Enabling Humanitarian Use of Mobile Phone Data.” The Brookings Institution, Issues in Technology Innovation. November 2014. Available from: http://brook.gs/1JxVpxp

  • Using Ebola as a case study, the authors describe the value of using private telecom data for uncovering “valuable insights into understanding the spread of infectious diseases as well as strategies into micro-target outreach and driving update of health-seeking behavior.”
  • The authors highlight the absence of a common legal and standards framework for “sharing mobile phone data in privacy-conscientious ways” and recommend “engaging companies, NGOs, researchers, privacy experts, and governments to agree on a set of best practices for new privacy-conscientious metadata sharing models.”

Eckartz, Silja M., Hofman, Wout J., Van Veenstra, Anne Fleur. “A decision model for data sharing.” Vol. 8653 LNCS. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2014. http://bit.ly/21cGWfw.

  • This paper proposes a decision model for data sharing of public and private data based on literature review and three case studies in the logistics sector.
  • The authors identify five categories of the barriers to data sharing and offer a decision model for identifying potential interventions to overcome each barrier:
    • Ownership. Possible interventions likely require improving trust among those who own the data through, for example, involvement and support from higher management
    • Privacy. Interventions include “anonymization by filtering of sensitive information and aggregation of data,” and access control mechanisms built around identity management and regulated access.  
    • Economic. Interventions include a model where data is shared only with a few trusted organizations, and yield management mechanisms to ensure negative financial consequences are avoided.
    • Data quality. Interventions include identifying additional data sources that could improve the completeness of datasets, and efforts to improve metadata.
    • Technical. Interventions include making data available in structured formats and publishing data according to widely agreed upon data standards.

Hoffman, Sharona and Podgurski, Andy. “The Use and Misuse of Biomedical Data: Is Bigger Really Better?” American Journal of Law & Medicine 497, 2013. http://bit.ly/1syMS7J.

  • This journal articles explores the benefits and, in particular, the risks related to large-scale biomedical databases bringing together health information from a diversity of sources across sectors. Some data collaboratives examined in the piece include:
    • MedMining – a company that extracts EHR data, de-identifies it, and offers it to researchers. The data sets that MedMining delivers to its customers include ‘lab results, vital signs, medications, procedures, diagnoses, lifestyle data, and detailed costs’ from inpatient and outpatient facilities.
    • Explorys has formed a large healthcare database derived from financial, administrative, and medical records. It has partnered with major healthcare organizations such as the Cleveland Clinic Foundation and Summa Health System to aggregate and standardize health information from ten million patients and over thirty billion clinical events.
  • Hoffman and Podgurski note that biomedical databases populated have many potential uses, with those likely to benefit including: “researchers, regulators, public health officials, commercial entities, lawyers,” as well as “healthcare providers who conduct quality assessment and improvement activities,” regulatory monitoring entities like the FDA, and “litigants in tort cases to develop evidence concerning causation and harm.”
  • They argue, however, that risks arise based on:
    • The data contained in biomedical databases is surprisingly likely to be incorrect or incomplete;
    • Systemic biases, arising from both the nature of the data and the preconceptions of investigators are serious threats the validity of research results, especially in answering causal questions;
  • Data mining of biomedical databases makes it easier for individuals with political, social, or economic agendas to generate ostensibly scientific but misleading research findings for the purpose of manipulating public opinion and swaying policymakers.

Krumholz, Harlan M., et al. “Sea Change in Open Science and Data Sharing Leadership by Industry.” Circulation: Cardiovascular Quality and Outcomes 7.4. 2014. 499-504. http://1.usa.gov/1J6q7KJ

  • This article provides a comprehensive overview of industry-led efforts and cross-sector collaborations in data sharing by pharmaceutical companies to inform clinical practice.
  • The article details the types of data being shared and the early activities of GlaxoSmithKline (“in coordination with other companies such as Roche and ViiV”); Medtronic and the Yale University Open Data Access Project; and Janssen Pharmaceuticals (Johnson & Johnson). The article also describes the range of involvement in data sharing among pharmaceutical companies including Pfizer, Novartis, Bayer, AbbVie, Eli Llly, AstraZeneca, and Bristol-Myers Squibb.

Mann, Gideon. “Private Data and the Public Good.” Medium. May 17, 2016. http://bit.ly/1OgOY68.

    • This Medium post from Gideon Mann, the Head of Data Science at Bloomberg, shares his prepared remarks given at a lecture at the City College of New York. Mann argues for the potential benefits of increasing access to private sector data, both to improve research and academic inquiry and also to help solve practical, real-world problems. He also describes a number of initiatives underway at Bloomberg along these lines.    
  • Mann argues that data generated at private companies “could enable amazing discoveries and research,” but is often inaccessible to those who could put it to those uses. Beyond research, he notes that corporate data could, for instance, benefit:
      • Public health – including suicide prevention, addiction counseling and mental health monitoring.
    • Legal and ethical questions – especially as they relate to “the role algorithms have in decisions about our lives,” such as credit checks and resume screening.
  • Mann recognizes the privacy challenges inherent in private sector data sharing, but argues that it is a common misconception that the only two choices are “complete privacy or complete disclosure.” He believes that flexible frameworks for differential privacy could open up new opportunities for responsibly leveraging data collaboratives.

Pastor Escuredo, D., Morales-Guzmán, A. et al, “Flooding through the Lens of Mobile Phone Activity.” IEEE Global Humanitarian Technology Conference, GHTC 2014. Available from: http://bit.ly/1OzK2bK

  • This report describes the impact of using mobile data in order to understand the impact of disasters and improve disaster management. The report was conducted in the Mexican state of Tabasco in 2009 as a multidisciplinary, multi-stakeholder consortium involving the UN World Food Programme (WFP), Telefonica Research, Technical University of Madrid (UPM), Digital Strategy Coordination Office of the President of Mexico, and UN Global Pulse.
  • Telefonica Research, a division of the major Latin American telecommunications company, provided call detail records covering flood-affected areas for nine months. This data was combined with “remote sensing data (satellite images), rainfall data, census and civil protection data.” The results of the data demonstrated that “analysing mobile activity during floods could be used to potentially locate damaged areas, efficiently assess needs and allocate resources (for example, sending supplies to affected areas).”
  • In addition to the results, the study highlighted “the value of a public-private partnership on using mobile data to accurately indicate flooding impacts in Tabasco, thus improving early warning and crisis management.”

* Perkmann, M. and Schildt, H. “Open data partnerships between firms and universities: The role of boundary organizations.” Research Policy, 44(5), 2015. http://bit.ly/25RRJ2c

  • This paper discusses the concept of a “boundary organization” in relation to industry-academic partnerships driven by data. Boundary organizations perform mediated revealing, allowing firms to disclose their research problems to a broad audience of innovators and simultaneously minimize the risk that this information would be adversely used by competitors.
  • The authors identify two especially important challenges for private firms to enter open data or participate in data collaboratives with the academic research community that could be addressed through more involvement from boundary organizations:
    • First is a challenge of maintaining competitive advantage. The authors note that, “the more a firm attempts to align the efforts in an open data research programme with its R&D priorities, the more it will have to reveal about the problems it is addressing within its proprietary R&D.”
    • Second, involves the misalignment of incentives between the private and academic field. Perkmann and Schildt argue that, a firm seeking to build collaborations around its opened data “will have to provide suitable incentives that are aligned with academic scientists’ desire to be rewarded for their work within their respective communities.”

Robin, N., Klein, T., & Jütting, J. “Public-Private Partnerships for Statistics: Lessons Learned, Future Steps.” OECD. 2016. http://bit.ly/24FLYlD.

  • This working paper acknowledges the growing body of work on how different types of data (e.g, telecom data, social media, sensors and geospatial data, etc.) can address data gaps relevant to National Statistical Offices (NSOs).
  • Four models of public-private interaction for statistics are describe: in-house production of statistics by a data-provider for a national statistics office (NSO), transfer of data-sets to NSOs from private entities, transfer of data to a third party provider to manage the NSO and private entity data, and the outsourcing of NSO functions.
  • The paper highlights challenges to public-private partnerships involving data (e.g., technical challenges, data confidentiality, risks, limited incentives for participation), suggests deliberate and highly structured approaches to public-private partnerships involving data require enforceable contracts, emphasizes the trade-off between data specificity and accessibility of such data, and the importance of pricing mechanisms that reflect the capacity and capability of national statistic offices.
  • Case studies referenced in the paper include:
    • A mobile network operator’s (MNO Telefonica) in house analysis of call detail records;
    • A third-party data provider and steward of travel statistics (Positium);
    • The Data for Development (D4D) challenge organized by MNO Orange; and
    • Statistics Netherlands use of social media to predict consumer confidence.

Stuart, Elizabeth, Samman, Emma, Avis, William, Berliner, Tom. “The data revolution: finding the missing millions.” Overseas Development Institute, 2015. Available from: http://bit.ly/1bPKOjw

  • The authors of this report highlight the need for good quality, relevant, accessible and timely data for governments to extend services into underrepresented communities and implement policies towards a sustainable “data revolution.”
  • The solutions focused on this recent report from the Overseas Development Institute focus on capacity-building activities of national statistical offices (NSOs), alternative sources of data (including shared corporate data) to address gaps, and building strong data management systems.

Taylor, L., & Schroeder, R. “Is bigger better? The emergence of big data as a tool for international development policy.” GeoJournal, 80(4). 2015. 503-518. http://bit.ly/1RZgSy4.

  • This journal article describes how privately held data – namely “digital traces” of consumer activity – “are becoming seen by policymakers and researchers as a potential solution to the lack of reliable statistical data on lower-income countries.
  • They focus especially on three categories of data collaborative use cases:
    • Mobile data as a predictive tool for issues such as human mobility and economic activity;
    • Use of mobile data to inform humanitarian response to crises; and
    • Use of born-digital web data as a tool for predicting economic trends, and the implications these have for LMICs.
  • They note, however, that a number of challenges and drawbacks exist for these types of use cases, including:
    • Access to private data sources often must be negotiated or bought, “which potentially means substituting negotiations with corporations for those with national statistical offices;”
    • The meaning of such data is not always simple or stable, and local knowledge is needed to understand how people are using the technologies in question
    • Bias in proprietary data can be hard to understand and quantify;
    • Lack of privacy frameworks; and
    • Power asymmetries, wherein “LMIC citizens are unwittingly placed in a panopticon staffed by international researchers, with no way out and no legal recourse.”

van Panhuis, Willem G., Proma Paul, Claudia Emerson, John Grefenstette, Richard Wilder, Abraham J. Herbst, David Heymann, and Donald S. Burke. “A systematic review of barriers to data sharing in public health.” BMC public health 14, no. 1 (2014): 1144. Available from: http://bit.ly/1JOBruO

  • The authors of this report provide a “systematic literature of potential barriers to public health data sharing.” These twenty potential barriers are classified in six categories: “technical, motivational, economic, political, legal and ethical.” In this taxonomy, “the first three categories are deeply rooted in well-known challenges of health information systems for which structural solutions have yet to be found; the last three have solutions that lie in an international dialogue aimed at generating consensus on policies and instruments for data sharing.”
  • The authors suggest the need for a “systematic framework of barriers to data sharing in public health” in order to accelerate access and use of data for public good.

Verhulst, Stefaan and Sangokoya, David. “Mapping the Next Frontier of Open Data: Corporate Data Sharing.” In: Gasser, Urs and Zittrain, Jonathan and Faris, Robert and Heacock Jones, Rebekah, “Internet Monitor 2014: Reflections on the Digital World: Platforms, Policy, Privacy, and Public Discourse (December 15, 2014).” Berkman Center Research Publication No. 2014-17. http://bit.ly/1GC12a2

  • This essay describe a taxonomy of current corporate data sharing practices for public good: research partnerships; prizes and challenges; trusted intermediaries; application programming interfaces (APIs); intelligence products; and corporate data cooperatives or pooling.
  • Examples of data collaboratives include: Yelp Dataset Challenge, the Digital Ecologies Research Partnerhsip, BBVA Innova Challenge, Telecom Italia’s Big Data Challenge, NIH’s Accelerating Medicines Partnership and the White House’s Climate Data Partnerships.
  • The authors highlight important questions to consider towards a more comprehensive mapping of these activities.

Verhulst, Stefaan and Sangokoya, David, 2015. “Data Collaboratives: Exchanging Data to Improve People’s Lives.” Medium. Available from: http://bit.ly/1JOBDdy

  • The essay refers to data collaboratives as a new form of collaboration involving participants from different sectors exchanging data to help solve public problems. These forms of collaborations can improve people’s lives through data-driven decision-making; information exchange and coordination; and shared standards and frameworks for multi-actor, multi-sector participation.
  • The essay cites four activities that are critical to accelerating data collaboratives: documenting value and measuring impact; matching public demand and corporate supply of data in a trusted way; training and convening data providers and users; experimenting and scaling existing initiatives.
  • Examples of data collaboratives include NIH’s Precision Medicine Initiative; the Mobile Data, Environmental Extremes and Population (MDEEP) Project; and Twitter-MIT’s Laboratory for Social Machines.

Verhulst, Stefaan, Susha, Iryna, Kostura, Alexander. “Data Collaboratives: matching Supply of (Corporate) Data to Solve Public Problems.” Medium. February 24, 2016. http://bit.ly/1ZEp2Sr.

  • This piece articulates a set of key lessons learned during a session at the International Data Responsibility Conference focused on identifying emerging practices, opportunities and challenges confronting data collaboratives.
  • The authors list a number of privately held data sources that could create positive public impacts if made more accessible in a collaborative manner, including:
    • Data for early warning systems to help mitigate the effects of natural disasters;
    • Data to help understand human behavior as it relates to nutrition and livelihoods in developing countries;
    • Data to monitor compliance with weapons treaties;
    • Data to more accurately measure progress related to the UN Sustainable Development Goals.
  • To the end of identifying and expanding on emerging practice in the space, the authors describe a number of current data collaborative experiments, including:
    • Trusted Intermediaries: Statistics Netherlands partnered with Vodafone to analyze mobile call data records in order to better understand mobility patterns and inform urban planning.
    • Prizes and Challenges: Orange Telecom, which has been a leader in this type of Data Collaboration, provided several examples of the company’s initiatives, such as the use of call data records to track the spread of malaria as well as their experience with Challenge 4 Development.
    • Research partnerships: The Data for Climate Action project is an ongoing large-scale initiative incentivizing companies to share their data to help researchers answer particular scientific questions related to climate change and adaptation.
    • Sharing intelligence products: JPMorgan Chase shares macro economic insights they gained leveraging their data through the newly established JPMorgan Chase Institute.
  • In order to capitalize on the opportunities provided by data collaboratives, a number of needs were identified:
    • A responsible data framework;
    • Increased insight into different business models that may facilitate the sharing of data;
    • Capacity to tap into the potential value of data;
    • Transparent stock of available data supply; and
    • Mapping emerging practices and models of sharing.

Vogel, N., Theisen, C., Leidig, J. P., Scripps, J., Graham, D. H., & Wolffe, G. “Mining mobile datasets to enable the fine-grained stochastic simulation of Ebola diffusion.” Paper presented at the Procedia Computer Science. 2015. http://bit.ly/1TZDroF.

  • The paper presents a research study conducted on the basis of the mobile calls records shared with researchers in the framework of the Data for Development Challenge by the mobile operator Orange.
  • The study discusses the data analysis approach in relation to developing a situation of Ebola diffusion built around “the interactions of multi-scale models, including viral loads (at the cellular level), disease progression (at the individual person level), disease propagation (at the workplace and family level), societal changes in migration and travel movements (at the population level), and mitigating interventions (at the abstract government policy level).”
  • The authors argue that the use of their population, mobility, and simulation models provide more accurate simulation details in comparison to high-level analytical predictions and that the D4D mobile datasets provide high-resolution information useful for modeling developing regions and hard to reach locations.

Welle Donker, F., van Loenen, B., & Bregt, A. K. “Open Data and Beyond.” ISPRS International Journal of Geo-Information, 5(4). 2016. http://bit.ly/22YtugY.

  • This research has developed a monitoring framework to assess the effects of open (private) data using a case study of a Dutch energy network administrator Liander.
  • Focusing on the potential impacts of open private energy data – beyond ‘smart disclosure’ where citizens are given information only about their own energy usage – the authors identify three attainable strategic goals:
    • Continuously optimize performance on services, security of supply, and costs;
    • Improve management of energy flows and insight into energy consumption;
    • Help customers save energy and switch over to renewable energy sources.
  • The authors propose a seven-step framework for assessing the impacts of Liander data, in particular, and open private data more generally:
    • Develop a performance framework to describe what the program is about, description of the organization’s mission and strategic goals;
    • Identify the most important elements, or key performance areas which are most critical to understanding and assessing your program’s success;
    • Select the most appropriate performance measures;
    • Determine the gaps between what information you need and what is available;
    • Develop and implement a measurement strategy to address the gaps;
    • Develop a performance report which highlights what you have accomplished and what you have learned;
    • Learn from your experiences and refine your approach as required.
  • While the authors note that the true impacts of this open private data will likely not come into view in the short term, they argue that, “Liander has successfully demonstrated that private energy companies can release open data, and has successfully championed the other Dutch network administrators to follow suit.”

World Economic Forum, 2015. “Data-driven development: pathways for progress.” Geneva: World Economic Forum. http://bit.ly/1JOBS8u

  • This report captures an overview of the existing data deficit and the value and impact of big data for sustainable development.
  • The authors of the report focus on four main priorities towards a sustainable data revolution: commercial incentives and trusted agreements with public- and private-sector actors; the development of shared policy frameworks, legal protections and impact assessments; capacity building activities at the institutional, community, local and individual level; and lastly, recognizing individuals as both produces and consumers of data.