Mapping the Next Frontier of Open Data: Corporate Data Sharing


Stefaan Verhulst at the GovLab (cross-posted at the UN Global Pulse Blog): “When it comes to data, we are living in the Cambrian Age. About ninety percent of the data that exists today has been generated within the last two years. We create 2.5 quintillion bytes of data on a daily basis—equivalent to a “new Google every four days.”
All of this means that we are certain to witness a rapid intensification in the process of “datafication”– already well underway. Use of data will grow increasingly critical. Data will confer strategic advantages; it will become essential to addressing many of our most important social, economic and political challenges.
This explains–at least in large part–why the Open Data movement has grown so rapidly in recent years. More and more, it has become evident that questions surrounding data access and use are emerging as one of the transformational opportunities of our time.
Today, it is estimated that over one million datasets have been made open or public. The vast majority of this open data is government data—information collected by agencies and departments in countries as varied as India, Uganda and the United States. But what of the terabyte after terabyte of data that is collected and stored by corporations? This data is also quite valuable, but it has been harder to access.
The topic of private sector data sharing was the focus of a recent conference organized by the Responsible Data Forum, Data and Society Research Institute and Global Pulse (see event summary). Participants at the conference, which was hosted by The Rockefeller Foundation in New York City, included representatives from a variety of sectors who converged to discuss ways to improve access to private data; the data held by private entities and corporations. The purpose for that access was rooted in a broad recognition that private data has the potential to foster much public good. At the same time, a variety of constraints—notably privacy and security, but also proprietary interests and data protectionism on the part of some companies—hold back this potential.
The framing for issues surrounding sharing private data has been broadly referred to under the rubric of “corporate data philanthropy.” The term refers to an emerging trend whereby companies have started sharing anonymized and aggregated data with third-party users who can then look for patterns or otherwise analyze the data in ways that lead to policy insights and other public good. The term was coined at the World Economic Forum meeting in Davos, in 2011, and has gained wider currency through Global Pulse, a United Nations data project that has popularized the notion of a global “data commons.”
Although still far from prevalent, some examples of corporate data sharing exist….

Help us map the field

A more comprehensive mapping of the field of corporate data sharing would draw on a wide range of case studies and examples to identify opportunities and gaps, and to inspire more corporations to allow access to their data (consider, for instance, the GovLab Open Data 500 mapping for open government data) . From a research point of view, the following questions would be important to ask:

  • What types of data sharing have proven most successful, and which ones least?
  • Who are the users of corporate shared data, and for what purposes?
  • What conditions encourage companies to share, and what are the concerns that prevent sharing?
  • What incentives can be created (economic, regulatory, etc.) to encourage corporate data philanthropy?
  • What differences (if any) exist between shared government data and shared private sector data?
  • What steps need to be taken to minimize potential harms (e.g., to privacy and security) when sharing data?
  • What’s the value created from using shared private data?

We (the GovLab; Global Pulse; and Data & Society) welcome your input to add to this list of questions, or to help us answer them by providing case studies and examples of corporate data philanthropy. Please add your examples below, use our Google Form or email them to us at corporatedata@thegovlab.org”

The Stasi, casinos and the Big Data rush


Book Review by Hannah Kuchler of “What Stays in Vegas” (by Adam Tanner) in the Financial Times: “Books with sexy titles and decidedly unsexy topics – like, say, data – have a tendency to disappoint. But What Stays in Vegas is an engrossing, story-packed takedown of the data industry.

It begins, far from America’s gambling capital, in communist East Germany. The author, Adam Tanner, now a fellow at Harvard’s Institute for Quantitative Social Science, was in the late 1980s a travel writer taking notes on Dresden. What he did not realise was that the Stasi was busy taking notes on him – 50 pages in all – which he found when the files were opened after reunification. The secret police knew where he had stopped to consult a map, to whom he asked questions and when he looked in on a hotel.
Today, Tanner explains: “Thanks to meticulous data gathering from both public documents and commercial records, companies . . . know far more about typical consumers than the feared East German secret police recorded about me.”
Shining a light on how businesses outside the tech sector have become data addicts, Tanner focuses on Las Vegas casinos, which spotted the value in data decades ago. He was given access to Caesar’s Entertainment, one of the world’s largest casino operators. When chief executive Gary Loveman joined in the late 1990s, the former Harvard Business School professor bet the company’s future on harvesting personal data from its loyalty scheme. Rather than wooing the “whales” who spent the most, the company would use the data to decide which freebies were worth giving away to lure in mid-spenders who came back often – a strategy credited with helping the business grow.
The real revelations come when Tanner examines the data brokers’ “Cheez Whiz”. Like the maker of a popular processed dairy spread, he argues, data brokers blend ingredients from a range of sources, such as public records, marketing lists and commercial records, to create a detailed picture of your identity – and you will never quite be able to pin down the origin of any component…
The Big Data rush has gone into overdrive since the global economic crisis as marketers from different industries have sought new methods to grab the limited consumer spending available. Tanner argues that while users have in theory given permission for much of this information to be made public in bits and pieces, increasingly industrial-scale aggregation often feels like an invasion of privacy.
Privacy policies are so long and obtuse (one study Tanner quotes found that it would take a person more than a month, working full-time, to read all the privacy statements they come across in a year), people are unwittingly littering their data all over the internet. Anyway, marketers can intuit what we are like from the people we are connected to online. And as the data brokers’ lists are usually private, there is no way to check the compilers have got their facts right…”

Citizen Science: The Law and Ethics of Public Access to Medical Big Data


New Paper by Sharona Hoffman: Patient-related medical information is becoming increasingly available on the Internet, spurred by government open data policies and private sector data sharing initiatives. Websites such as HealthData.gov, GenBank, and PatientsLikeMe allow members of the public to access a wealth of health information. As the medical information terrain quickly changes, the legal system must not lag behind. This Article provides a base on which to build a coherent data policy. It canvasses emergent data troves and wrestles with their legal and ethical ramifications.
Publicly accessible medical data have the potential to yield numerous benefits, including scientific discoveries, cost savings, the development of patient support tools, healthcare quality improvement, greater government transparency, public education, and positive changes in healthcare policy. At the same time, the availability of electronic personal health information that can be mined by any Internet user raises concerns related to privacy, discrimination, erroneous research findings, and litigation. This Article analyzes the benefits and risks of health data sharing and proposes balanced legislative, regulatory, and policy modifications to guide data disclosure and use.”

The Crypto-democracy and the Trustworthy


New Paper by Sebastien Gambs, Samuel Ranellucci, and Alain Tapp: “In the current architecture of the Internet, there is a strong asymmetry in terms of power between the entities that gather and process personal data (e.g., major Internet companies, telecom operators, cloud providers, …) and the individuals from which this personal data is issued. In particular, individuals have no choice but to blindly trust that these entities will respect their privacy and protect their personal data. In this position paper, we address this issue by proposing an utopian crypto-democracy model based on existing scientific achievements from the field of cryptography. More precisely, our main objective is to show that cryptographic primitives, including in particular secure multiparty computation, offer a practical solution to protect privacy while minimizing the trust assumptions. In the crypto-democracy envisioned, individuals do not have to trust a single physical entity with their personal data but rather their data is distributed among several institutions. Together these institutions form a virtual entity called the Trustworthy that is responsible for the storage of this data but which can also compute on it (provided first that all the institutions agree on this). Finally, we also propose a realistic proof-of-concept of the Trustworthy, in which the roles of institutions are played by universities. This proof-of-concept would have an important impact in demonstrating the possibilities offered by the crypto-democracy paradigm.”

From “Bitcoin to Burning Man and Beyond”


IDCubed: “From Bitcoin to Burning Man and Beyond: The Quest for Autonomy and Identity in a Digital Society explores a new generation of digital technologies that are re-imagining the very foundations of identity, governance, trust and social organization.
The fifteen essays of this book stake out the foundations of a new future – a future of open Web standards and data commons, a society of decentralized autonomous organizations, a world of trustworthy digital currencies and self-organized and expressive communities like Burning Man.
Among the contributors are Alex “Sandy” Pentland of the M.I.T. Human Dynamics Laboratory, former FCC Chairman Reed E. Hundt, long-time IBM strategist Irving Wladawksy-Berger, monetary system expert Bernard Lietaer, Silicon Valley entrepreneur Peter Hirshberg, journalist Jonathan Ledgard and H-Farm cofounder Maurizio Rossi.
From Bitcoin to Burning Man and Beyond was edited by Dr. John H. Clippinger, cofounder and executive director of ID3, and David Bollier, an Editor at ID3 who is also an author, blogger and scholar who studies the commons. The book, published by ID3 in association with Off the Common Books, reflects ID3’s vision of the huge, untapped potential for self-organized, distributed governance on open platforms.
The book is available in print and ebook formats (Kindle and epub) from Amazon.com and Off the Common Books. The book, licensed under a Creative Commons Attribution-NonCommercial-ShareAlike license (BY-NC-SA), may also be downloaded for free as a pdf file from ID3.
One chapter that inspires the book’s title traces the 28-year history of Burning Man, the week-long encampment in the Nevada desert that have hosted remarkable experimentation in new forms of self-governance by large communities. Other chapters explore such cutting-edge concepts as

  • evolvable digital contracts that could supplant conventional legal agreements;
  • smartphone currencies that could help Africans meet their economic needs more effective;
  • the growth of the commodity-backed Ven currency; and
  • new types of “solar currencies” that borrow techniques from Bitcoin to enable more efficient, cost-effective solar generation and sharing by homeowners.

From Bitcoin to Burning Man and Beyond also introduces the path-breaking software platform that ID3 has developed called “Open Mustard Seed,” or OMS. The just-released open source program enables the rise of new types of trusted, self-healing digital institutions on open networks, which in turn will make possible new sorts of privacy-friendly social ecosystems.
“OMS is an integrated, open source package of programs that lets people collect and share personal information in secure, and transparent and accountable ways, enabling authentic, trusted social and economic relationships to flourish,” said Dr. John H. Clippinger, executive director of ID3, an acronym for the Institute for Institutional Innovation and Data-Driven Design.
“The software builds individual privacy, security and trusted exchange into the very design of the system. In effect, OMS represents a new authentication, privacy and sharing layer for the Internet,” said Clippinger “– a new way to share personal information selectively and securely, without access by unauthorized third parties.”
A two-minute video introducing the capabilities of OMS can be viewed here.”

The Changing Nature of Privacy Practice


Numerous commenters have observed that Facebook, among many marketers (including political campaigns like U.S. President Barack Obama’s), regularly conducts A-B tests and other research to measure how consumers respond to different products, messages and messengers. So what makes the Facebook-Cornell study different from what goes on all the time in an increasingly data-driven world? After all, the ability to conduct such testing continuously on a large scale is considered one of the special features of big data.
The answer calls for broader judgments than parsing the language of privacy policies or managing compliance with privacy laws and regulations. Existing legal tools such as notice-and-choice and use limitations are simply too narrow to address the array of issues presented and inform the judgment needed. Deciding whether Facebook ought to participate in research like its newsfeed study is not really about what the company can do but what it should do.
As Omer Tene and Jules Polonetsky, CIPP/US, point out in an article on Facebook’s research study, “Increasingly, corporate officers find themselves struggling to decipher subtle social norms and make ethical choices that are more befitting of philosophers than business managers or lawyers.” They add, “Going forward, companies will need to create new processes, deploying a toolbox of innovative solutions to engender trust and mitigate normative friction.” Tene and Polonetsky themselves have proposed a number of such tools. In recent comments on Consumer Privacy Bill of Rights legislation filed with the Commerce Department, the Future of Privacy Forum (FPF) endorsed the use of internal review boards along the lines of those used in academia for human-subject research. The FPF also submitted an initial framework for benefit-risk analysis in the big data context “to understand whether assuming the risk is ethical, fair, legitimate and cost-effective.” Increasingly, companies and other institutions are bringing to bear more holistic review of privacy issues. Conferences and panels on big data research ethics are proliferating.
The expanding variety and complexity of data uses also call for a broader public policy approach. The Obama administration’s Consumer Privacy Bill of Rights (of which I was an architect) adapted existing Fair Information Practice Principles to a principles-based approach that is intended not as a formalistic checklist but as a set of principles that work holistically in ways that are “flexible” and “dynamic.” In turn, much of the commentary submitted to the Commerce Department on the Consumer Privacy Bill of Rights addressed the question of the relationship between these principles and a “responsible use framework” as discussed in the White House Big Data Report….”

Not just the government’s playbook


at Radar: “Whenever I hear someone say that “government should be run like a business,” my first reaction is “do you know how badly most businesses are run?” Seriously. I do not want my government to run like a business — whether it’s like the local restaurants that pop up and die like wildflowers, or megacorporations that sell broken products, whether financial, automotive, or otherwise.
If you read some elements of the press, it’s easy to think that healthcare.gov is the first time that a website failed. And it’s easy to forget that a large non-government website was failing, in surprisingly similar ways, at roughly the same time. I’m talking about the Common App site, the site high school seniors use to apply to most colleges in the US. There were problems with pasting in essays, problems with accepting payments, problems with the app mysteriously hanging for hours, and more.
 
I don’t mean to pick on Common App; you’ve no doubt had your own experience with woefully bad online services: insurance companies, Internet providers, even online shopping. I’ve seen my doctor swear at the Epic electronic medical records application when it crashed repeatedly during an appointment. So, yes, the government builds bad software. So does private enterprise. All the time. According to TechRepublic, 68% of all software projects fail. We can debate why, and we can even debate the numbers, but there’s clearly a lot of software #fail out there — in industry, in non-profits, and yes, in government.
With that in mind, it’s worth looking at the U.S. CIO’s Digital Services Playbook. It’s not ideal, and in many respects, its flaws reveal its origins. But it’s pretty good, and should certainly serve as a model, not just for the government, but for any organization, small or large, that is building an online presence.
The playbook consists of 13 principles (called “plays”) that drive modern software development:

  • Understand what people need
  • Address the whole experience, from start to finish
  • Make it simple and intuitive
  • Build the service using agile and iterative practices
  • Structure budgets and contracts to support delivery
  • Assign one leader and hold that person accountable
  • Bring in experienced teams
  • Choose a modern technology stack
  • Deploy in a flexible hosting environment
  • Automate testing and deployments
  • Manage security and privacy through reusable processes
  • Use data to drive decisions
  • Default to open

These aren’t abstract principles: most of them should be familiar to anyone who has read about agile software development, attended one of our Velocity conferences, one of the many DevOps Days, or a similar event. All of the principles are worth reading (it’s not a long document). I’m going to call out two for special attention….”

Reddit, Imgur and Twitch team up as 'Derp' for social data research


in The Guardian: “Academic researchers will be granted unprecedented access to the data of major social networks including Imgur, Reddit, and Twitch as part of a joint initiative: The Digital Ecologies Research Partnership (Derp).
Derp – and yes, that really is its name – will be offering data to universities including Harvard, MIT and McGill, to promote “open, publicly accessible, and ethical academic inquiry into the vibrant social dynamics of the web”.
It came about “as a result of Imgur talking with a number of other community platforms online trying to learn about how they work with academic researchers,” says Tim Hwang, the image-sharing site’s head of special initiatives.
“In most cases, the data provided through Derp will already be accessible through public APIs,” he says. “Our belief is that there are ways of doing research better, and in a way that strongly respects user privacy and responsible use of data.
“Derp is an alliance of platforms that all believe strongly in this. In working with academic researchers, we support projects that meet institutional review at their home institution, and all research supported by Derp will be released openly and made publicly available.”
Hwang points to a Stanford paper analysing the success of Reddit’s Random Acts of Pizza subforum as an example of the sort of research Derp hopes to foster. In the research, Tim Althoff, Niloufar Salehi and Tuan Nguyen found that the likelihood of getting a free pizza from the Reddit community depended on a number of factors, including how the request was phrased, how much the user posted on the site, and how many friends they had online. In the end, they were able to predict with 67% accuracy whether or not a given request would be fulfilled.
The grouping aims to solve two problems academic research faces. Researchers themselves find it hard to get data outside of the larges social media platforms, such as Twitter and Facebook. The major services at least have a vibrant community of developers and researchers working on ways to access and use data, but for smaller communities, there’s little help provided.
Yet smaller is relative: Reddit may be a shrimp compared to Facebook, but with 115 million unique visitors every month, it’s still a sizeable community. And so Derp aims to offer “a single point of contact for researchers to get in touch with relevant team members across a range of different community sites….”

Reality Mining: Using Big Data to Engineer a Better World


New book by Nathan Eagle and Kate Greene : “Big Data is made up of lots of little data: numbers entered into cell phones, addresses entered into GPS devices, visits to websites, online purchases, ATM transactions, and any other activity that leaves a digital trail. Although the abuse of Big Data—surveillance, spying, hacking—has made headlines, it shouldn’t overshadow the abundant positive applications of Big Data. In Reality Mining, Nathan Eagle and Kate Greene cut through the hype and the headlines to explore the positive potential of Big Data, showing the ways in which the analysis of Big Data (“Reality Mining”) can be used to improve human systems as varied as political polling and disease tracking, while considering user privacy.

Eagle, a recognized expert in the field, and Greene, an experienced technology journalist, describe Reality Mining at five different levels: the individual, the neighborhood and organization, the city, the nation, and the world. For each level, they first offer a nontechnical explanation of data collection methods and then describe applications and systems that have been or could be built. These include a mobile app that helps smokers quit smoking; a workplace “knowledge system”; the use of GPS, Wi-Fi, and mobile phone data to manage and predict traffic flows; and the analysis of social media to track the spread of disease. Eagle and Greene argue that Big Data, used respectfully and responsibly, can help people live better, healthier, and happier lives.”

Digital Footprints: Opportunities and Challenges for Online Social Research


Paper by Golder, Scott A. and Macy, Michael for the Annual Review of Sociology: “Online interaction is now a regular part of daily life for a demographically diverse population of hundreds of millions of people worldwide. These interactions generate fine-grained time-stamped records of human behavior and social interaction at the level of individual events, yet are global in scale, allowing researchers to address fundamental questions about social identity, status, conflict, cooperation, collective action, and diffusion, both by using observational data and by conducting in vivo field experiments. This unprecedented opportunity comes with a number of methodological challenges, including generalizing observations to the offline world, protecting individual privacy, and solving the logistical challenges posed by “big data” and web-based experiments. We review current advances in online social research and critically assess the theoretical and methodological opportunities and limitations. [J]ust as the invention of the telescope revolutionized the study of the heavens, so too by rendering the unmeasurable measurable, the technological revolution in mobile, Web, and Internet communications has the potential to revolutionize our understanding of ourselves and how we interact…. [T]hree hundred years after Alexander Pope argued that the proper study of mankind should lie not in the heavens but in ourselves, we have finally found our telescope. Let the revolution begin. —Duncan Watts”