Critics allege big data can be discriminatory, but is it really bias?


Pradip Sigdyal at CNBC: “…The often cited case of big data discrimination points to a research conducted few years ago by Latanya Sweeny, who heads the Data Privacy Lab at Harvard University.

The case involves Google ad results when searching for certain kinds of names on the internet. In her research, Sweeney found that distinct sounding names often associated with blacks showed up with a disproportionately higher number of arrest record ads compared to white sounding names by roughly 18 percent of the time. Google has since fixed the issue, although they never publicly stated what they did to correct the problem.

The proliferation of big data in the last few years has seen other allegations of improper use and bias. These allegations run the gamut, from online price discrimination and consequences of geographic targeting to the controversial use of crime predicting technology by law enforcement, and lack of sufficient representative[data] sampleused in some public works decisions.

The benefits of big data need to be balanced with the risks associated with applying modern technologies to address societal issues. Yet data advocates believe that democratization of data has in essence givenpower to the people to affect change by transferring ‘tribal knowledge’ from experts to data-savvy practitioners.

Big data is here to stay

According to some advocates, the problem is not so much that ‘big data discriminates’, but that failures by data professionals risk misinterpreting the findings at the heart of data mining and statistical learning. They add that the benefits far outweigh the concerns.

“In my academic research and industry consulting, I have seen tremendous benefits accruing to firms, organizations and consumers alike from the use of data-driven decision-making, data science, and business analytics,” Anindya Ghose, the director of Center for Business Analytics at New York University’s Stern School of Business, said.

“To be perfectly honest, I do not at all understand these big-data cynics who engage in fear mongering about the implications of data analytics,” Ghose said.

“Here is my message to the cynics and those who keep cautioning us: ‘Deal with it, big data analytics is here to stay forever’.”…(More)”

OSoMe: The IUNI observatory on social media


Clayton A Davis et al at Peer J. PrePrint:  “The study of social phenomena is becoming increasingly reliant on big data from online social networks. Broad access to social media data, however, requires software development skills that not all researchers possess. Here we present the IUNI Observatory on Social Media, an open analytics platform designed to facilitate computational social science. The system leverages a historical, ongoing collection of over 70 billion public messages from Twitter. We illustrate a number of interactive open-source tools to retrieve, visualize, and analyze derived data from this collection. The Observatory, now available at osome.iuni.iu.edu, is the result of a large, six-year collaborative effort coordinated by the Indiana University Network Science Institute.”…(More)”

What’s Wrong with Open-Data Sites–and How We Can Fix Them


César A. Hidalgo at Scientific American: “Imagine shopping in a supermarket where every item is stored in boxes that look exactly the same. Some are filled with cereal, others with apples, and others with shampoo. Shopping would be an absolute nightmare! The design of most open data sites—the (usually government) sites that distribute census, economic and other data to be used and redistributed freely—is not exactly equivalent to this nightmarish supermarket. But it’s pretty close.

During the last decade, such sites—data.gov, data.gov.uk, data.gob.cl,data.gouv.fr, and many others—have been created throughout the world. Most of them, however, still deliver data as sets of links to tables, or links to other sites that are also hard to comprehend. In the best cases, data is delivered through APIs, or application program interfaces, which are simple data query languages that require a user to have a basic knowledge of programming. So understanding what is inside each dataset requires downloading, opening, and exploring the set in ways that are extremely taxing for users. The analogy of the nightmarish supermarket is not that far off.

THE U.S. GOVERNMENT’S OPEN DATA SITE

The consensus among those who have participated in the creation of open data sites is that current efforts have failed and we need new options. Pointing your browser to these sites should show you why. Most open data sites are badly designed, and here I am not talking about their aesthetics—which are also subpar—but about the conceptual model used to organize and deliver data to users. The design of most open data sites follows a throwing-spaghetti-against-the-wall strategy, where opening more data, instead of opening data better, has been the driving force.

Some of the design flaws of current open data sites are pretty obvious. The datasets that are more important, or could potentially be more useful, are not brought into the surface of these sites or are properly organized. In our supermarket analogy, not only all boxes look the same, but also they are sorted in the order they came. This cannot be the best we can do.

There are other design problems that are important, even though they are less obvious. The first one is that most sites deliver data in the way in which it is collected, instead of used. People are often looking for data about a particular place, occupation, industry, or about an indicator (such as income, or population). If the data they need comes from the national survey of X, or the bureau of Y, it is secondary and often—although not always—irrelevant to the user. Yet, even though this is not the way we should be giving data back to users, this is often what open data sites do.

The second non-obvious design problem, which is probably the most important, is that most open data sites bury data in what is known as the deep web. The deep web is the fraction of the Internet that is not accessible to search engines, or that cannot be indexed properly. The surface of the web is made of text, pictures, and video, which search engines know how to index. But search engines are not good at knowing that the number that you are searching for is hidden in row 17,354 of a comma separated file that is inside a zip file linked in a poorly described page of an open data site. In some cases, pressing a radio button and selecting options from a number of dropdown menus can get you the desired number, but this does not help search engines either, because crawlers cannot explore dropdown menus. To make open data really open, we need to make it searchable, and for that we need to bring data to the surface of the web.

So how do we that? The solution may not be simple, but it starts by taking design seriously. This is something that I’ve been doing for more than half a decade when creating data visualization engines at MIT. The latest iteration of our design principles are now embodied in DataUSA, a site we created in a collaboration between Deloitte, Datawheel, and my group at MIT.

So what is design, and how do we use it to improve open data sites? My definition of design is simple. Design is discovering the forms that best fulfill a function….(More)”

A Framework for Understanding Data Risk


Sarah Telford and Stefaan G. Verhulst at Understanding Risk Forum: “….In creating the policy, OCHA partnered with the NYU Governance Lab (GovLab) and Leiden University to understand the policy and privacy landscape, best practices of partner organizations, and how to assess the data it manages in terms of potential harm to people.

We seek to share our findings with the UR community to get feedback and start a conversation around the risk to using certain types of data in humanitarian and development efforts and when understanding risk.

What is High-Risk Data?

High-risk data is generally understood as data that includes attributes about individuals. This is commonly referred to as PII or personally identifiable information. Data can also create risk when it identifies communities or demographics within a group and ties them to a place (i.e., women of a certain age group in a specific location). The risk comes when this type of data is collected and shared without proper authorization from the individual or the organization acting as the data steward; or when the data is being used for purposes other than what was initially stated during collection.

The potential harms of inappropriately collecting, storing or sharing personal data can affect individuals and communities that may feel exploited or vulnerable as the result of how data is used. This became apparent during the Ebola outbreak of 2014, when a number of data projects were implemented without appropriate risk management measures. One notable example was the collection and use of aggregated call data records (CDRs) to monitor the spread of Ebola, which not only had limited success in controlling the virus, but also compromised the personal information of those in Ebola-affected countries. (See Ebola: A Big Data Disaster).

A Data-Risk Framework

Regardless of an organization’s data requirements, it is useful to think through the potential risks and harms for its collection, storage and use. Together with the Harvard Humanitarian Initiative, we have set up a four-step data risk process that includes doing an assessment and inventory, understanding risks and harms, and taking measures to counter them.

  1. Assessment – The first step is to understand the context within which the data is being generated and shared. The key questions to ask include: What is the anticipated benefit of using the data? Who has access to the data? What constitutes the actionable information for a potential perpetrator? What could set off the threat to the data being used inappropriately?
  1. Data Inventory – The second step is to take inventory of the data and how it is being stored. Key questions include: Where is the data – is it stored locally or hosted by a third party? Where could the data be housed later? Who might gain access to the data in the future? How will we know – is data access being monitored?
  1. Risks and Harms – The next step is to identify potential ways in which risk might materialize. Thinking through various risk-producing scenarios will help prepare staff for incidents. Examples of risks include: your organization’s data being correlated with other data sources to expose individuals; your organization’s raw data being publicly released; and/or your organization’s data system being maliciously breached.
  1. Counter-Measures – The next step is to determine what measures would prevent risk from materializing. Methods and tools include developing data handling policies, implementing access controls to the data, and training staff on how to use data responsibly….(More)

Big Risks, Big Opportunities: the Intersection of Big Data and Civil Rights


Latest White House report on Big Data charts pathways for fairness and opportunity but also cautions against re-encoding bias and discrimination into algorithmic systems: ” Advertisements tailored to reflect previous purchasing decisions; targeted job postings based on your degree and social networks; reams of data informing predictions around college admissions and financial aid. Need a loan? There’s an app for that.

As technology advances and our economic, social, and civic lives become increasingly digital, we are faced with ethical questions of great consequence. Big data and associated technologies create enormous new opportunities to revisit assumptions and instead make data-driven decisions. Properly harnessed, big data can be a tool for overcoming longstanding bias and rooting out discrimination.

The era of big data is also full of risk. The algorithmic systems that turn data into information are not infallible—they rely on the imperfect inputs, logic, probability, and people who design them. Predictors of success can become barriers to entry; careful marketing can be rooted in stereotype. Without deliberate care, these innovations can easily hardwire discrimination, reinforce bias, and mask opportunity.

Because technological innovation presents both great opportunity and great risk, the White House has released several reports on “big data” intended to prompt conversation and advance these important issues. The topics of previous reports on data analytics included privacy, prices in the marketplace, and consumer protection laws. Today, we are announcing the latest report on big data, one centered on algorithmic systems, opportunity, and civil rights.

The first big data report warned of “the potential of encoding discrimination in automated decisions”—that is, discrimination may “be the inadvertent outcome of the way big data technologies are structured and used.” A commitment to understanding these risks and harnessing technology for good prompted us to specifically examine the intersection between big data and civil rights.

Using case studies on credit lending, employment, higher education, and criminal justice, the report we are releasing today illustrates how big data techniques can be used to detect bias and prevent discrimination. It also demonstrates the risks involved, particularly how technologies can deliberately or inadvertently perpetuate, exacerbate, or mask discrimination.

The purpose of the report is not to offer remedies to the issues it raises, but rather to identify these issues and prompt conversation, research—and action—among technologists, academics, policy makers, and citizens, alike.

The report includes a number of recommendations for advancing work in this nascent field of data and ethics. These include investing in research, broadening and diversifying technical leadership, cross-training, and expanded literacy on data discrimination, bolstering accountability, and creating standards for use within both the government and the private sector. It also calls on computer and data science programs and professionals to promote fairness and opportunity as part of an overall commitment to the responsible and ethical use of data.

Big data is here to stay; the question is how it will be used: to advance civil rights and opportunity, or to undermine them….(More)”

Beyond the Digital Divide: Towards a Situated Approach to Open Data


Paper by Bezuidenhout, L, Rappert, B, Kelly, A and Leonelli, S: “Poor provision of information and communication technologies in low/middle-income countries represents a concern for promoting Open Data. This is often framed as a “digital divide” and addressed through initiatives that increase the availability of information and communication technologies to researchers based in low-resourced environments, as well as the amount of resources freely accessible online, including data themselves. Using empirical data from a qualitative study of lab-based research in Africa we highlight the limitations of such framing and emphasize the range of additional factors necessary to effectively utilize data available online. We adopt the ‘Capabilities Approach’ proposed by Sen to highlight the distinction between simply making resources available, and doing so while fostering researchers’ ability to use them. This provides an alternative orientation that highlights the persistence of deep inequalities within the seemingly egalitarian-inspired Open Data landscape. The extent and manner of future data sharing, we propose, will hinge on the ability to respond to the heterogeneity of research environments…(More)

Citizens breaking out of filter bubbles: Urban screens as civic media


Conference Paper by Satchell, Christine et al :”Social media platforms risk polarising public opinions by employing proprietary algorithms that produce filter bubbles and echo chambers. As a result, the ability of citizens and communities to engage in robust debate in the public sphere is diminished. In response, this paper highlights the capacity of urban interfaces, such as pervasive displays, to counteract this trend by exposing citizens to the socio-cultural diversity of the city. Engagement with different ideas, networks and communities is crucial to both innovation and the functioning of democracy. We discuss examples of urban interfaces designed to play a key role in fostering this engagement. Based on an analysis of works empirically-grounded in field observations and design research, we call for a theoretical framework that positions pervasive displays and other urban interfaces as civic media. We argue that when designed for more than wayfinding, advertisement or television broadcasts, urban screens as civic media can rectify some of the pitfalls of social media by allowing the polarised user to break out of their filter bubble and embrace the cultural diversity and richness of the city….(More)”

mySidewalk


Springwise: “With vast amounts of data now publicly available, the answers to many questions lie buried in the numbers, and we already saw a publishing platform helping entrepreneurs visualize government data. For an organization as passionate about civic engagement asmySidewalk, this open data is a treasure trove of compelling stories.

mySidewalk was founded by city planners who recognized the potential force for change contained in local communities. Yet without a compelling reason to get involved, many individuals remain ‘interested bystanders’ — something mySidewalk is determined to change.

Using the latest available data, mySidewalk creates dashboards that are customized for every project to help local public officials make the most informed decisions possible. The dashboards present visualizations of a wide range of socioeconomic and demographic datasets, as well as provide local, regional and national comparisons, all of which help to tell the stories behind the numbers.

It is those stories that mySidewalk believes will provide enough motivation for the ‘interested bystanders’ to get involved. As it says on the mySidewalk website, “Share your ideas. Shape your community.” Organizations of all types have taken notice of the power of data, with businesses using geo-tagging to analyze social media content, and real-time information sharing helping humanitarians in crises….(More)”

Supply and demand of open data in Mexico: A diagnostic report on the government’s new open data portal


Report by Juan Ortiz Freuler: “Following a promising and already well-established trend, in February 2014 the Office of the President of Mexico launched its open data portal (datos.gob.mx). This diagnostic –carried out between July and September of 2015- is designed to brief international donors and stakeholders such as members of the Open Government Partnership Steering Committee, provides the reader with contextual information to understand the state of supply and demand for open data from the portal, and the specific challenges the mexican government is facing in its quest to implement the policy. The insights offered through data processing and interviews with key stakeholders indicate the need to promote: i) A sense of ownership of datos.gob.mx by the user community, but particularly by the officials in charge of implementing the policy within each government unit; ii) The development of tools and mechanisms to increase the quality of the data provided through the portal; and iii) Civic hacking of the portal to promote innovation, and a sense of appropriation that would increase the policy’s long-term resilience to partisan and leadership change….(More)”

See also Underlying data: http://bit.ly/dataMXEng1Spanish here: http://bit.ly/DataMxCastellUnderlying data:http://bit.ly/dataMX2

NEW Platform for Sharing Research on Opening Governance: The Open Governance Research Exchange (OGRX)


Andrew Young: “Today,  The GovLab, in collaboration with founding partners mySociety and the World Bank’s Digital Engagement Evaluation Team are launching the Open Governance Research Exchange (OGRX), a new platform for sharing research and findings on innovations in governance.

From crowdsourcing to nudges to open data to participatory budgeting, more open and innovative ways to tackle society’s problems and make public institutions more effective are emerging. Yet little is known about what innovations actually work, when, why, for whom and under what conditions.

And anyone seeking existing research is confronted with sources that are widely dispersed across disciplines, often locked behind pay walls, and hard to search because of the absence of established taxonomies. As the demand to confront problems in new ways grows so too does the urgency for making learning about governance innovations more accessible.

As part of GovLab’s broader effort to move from “faith-based interventions” toward more “evidence-based interventions,” OGRX curates and makes accessible the most diverse and up-to-date collection of findings on innovating governance. At launch, the site features over 350 publications spanning a diversity of governance innovation areas, including but not limited to:

Visit ogrx.org to explore the latest research findings, submit your own work for inclusion on the platform, and share knowledge with others interested in using science and technology to improve the way we govern. (More)”