Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency


at Medium: “…So why, then, does granular, social data make people uncomfortable? Well, ultimately—and at the risk of stating the obvious—it’s because data of this sort brings up issues regarding ethics, privacy, bias, fairness, and inclusion. In turn, these issues make people uncomfortable because, at least as the popular narrative goes, these are new issues that fall outside the expertise of those those aggregating and analyzing big data. But the thing is, these issues aren’t actually new. Sure, they may be new to computer scientists and software engineers, but they’re not new to social scientists.

This is why I think the world of big data and those working in it — ranging from the machine learning researchers developing new analysis tools all the way up to the end-users and decision-makers in government and industry — can learn something from computational social science….

So, if technology companies and government organizations — the biggest players in the big data game — are going to take issues like bias, fairness, and inclusion seriously, they need to hire social scientists — the people with the best training in thinking about important societal issues. Moreover, it’s important that this hiring is done not just in a token, “hire one social scientist for every hundred computer scientists” kind of way, but in a serious, “creating interdisciplinary teams” kind of kind of way.


Thanks to Moritz Hardt for the picture!

While preparing for my talk, I read an article by Moritz Hardt, entitled “How Big Data is Unfair.” In this article, Moritz notes that even in supposedly large data sets, there is always proportionally less data available about minorities. Moreover, statistical patterns that hold for the majority may be invalid for a given minority group. He gives, as an example, the task of classifying user names as “real” or “fake.” In one culture — comprising the majority of the training data — real names might be short and common, while in another they might be long and unique. As a result, the classic machine learning objective of “good performance on average,” may actually be detrimental to those in the minority group….

As an alternative, I would advocate prioritizing vital social questions over data availability — an approach more common in the social sciences. Moreover, if we’re prioritizing social questions, perhaps we should take this as an opportunity to prioritize those questions explicitly related to minorities and bias, fairness, and inclusion. Of course, putting questions first — especially questions about minorities, for whom there may not be much available data — means that we’ll need to go beyond standard convenience data sets and general-purpose “hammer” methods. Instead we’ll need to think hard about how best to instrument data aggregation and curation mechanisms that, when combined with precise, targeted models and tools, are capable of elucidating fine-grained, hard-to-see patterns….(More).”

Geneticists Begin Tests of an Internet for DNA


Antonio Regalado in MIT Technology Review: “A coalition of geneticists and computer programmers calling itself the Global Alliance for Genomics and Health is developing protocols for exchanging DNA information across the Internet. The researchers hope their work could be as important to medical science as HTTP, the protocol created by Tim Berners-Lee in 1989, was to the Web.
One of the group’s first demonstration projects is a simple search engine that combs through the DNA letters of thousands of human genomes stored at nine locations, including Google’s server farms and the University of Leicester, in the U.K. According to the group, which includes key players in the Human Genome Project, the search engine is the start of a kind of Internet of DNA that may eventually link millions of genomes together.
The technologies being developed are application program interfaces, or APIs, that let different gene databases communicate. Pooling information could speed discoveries about what genes do and help doctors diagnose rare birth defects by matching children with suspected gene mutations to others who are known to have them.
The alliance was conceived two years ago at a meeting in New York of 50 scientists who were concerned that genome data was trapped in private databases, tied down by legal consent agreements with patients, limited by privacy rules, or jealously controlled by scientists to further their own scientific work. It styles itself after the World Wide Web Consortium, or W3C, a body that oversees standards for the Web.
“It’s creating the Internet language to exchange genetic information,” says David Haussler, scientific director of the genome institute at the University of California, Santa Cruz, who is one of the group’s leaders.
The group began releasing software this year. Its hope—as yet largely unrealized—is that any scientist will be able to ask questions about genome data possessed by other laboratories, without running afoul of technical barriers or privacy rules….(More)”

The Free 'Big Data' Sources Everyone Should Know


Bernard Marr at Linkedin Pulse: “…The moves by companies and governments to put large amounts of information into the public domain have made large volumes of data accessible to everyone….here’s my rundown of some of the best free big data sources available today.

Data.gov

The US Government pledged last year to make all government data available freely online. This site is the first stage and acts as a portal to all sorts of amazing information on everything from climate to crime. To check it out, click here.

US Census Bureau

A wealth of information on the lives of US citizens covering population data, geographic data and education. To check it out, click here. To check it out, click here.

European Union Open Data Portal

As the above, but based on data from European Union institutions. To check it out, click here.

Data.gov.uk

Data from the UK Government, including the British National Bibliography – metadata on all UK books and publications since 1950. To check it out, click here.

The CIA World Factbook

Information on history, population, economy, government, infrastructure and military of 267 countries. To check it out, click here.

Healthdata.gov

125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics. To check it out, click here.

NHS Health and Social Care Information Centre

Health data sets from the UK National Health Service. To check it out, click here.

Amazon Web Services public datasets

Huge resource of public data, including the 1000 Genome Project, an attempt to build the most comprehensive database of human genetic information and NASA’s database of satellite imagery of Earth. To check it out, click here.

Facebook Graph

Although much of the information on users’ Facebook profile is private, a lot isn’t – Facebook provide the Graph API as a way of querying the huge amount of information that its users are happy to share with the world (or can’t hide because they haven’t worked out how the privacy settings work). To check it out, click here.

Gapminder

Compilation of data from sources including the World Health Organization and World Bank covering economic, medical and social statistics from around the world. To check it out, click here.

Google Trends

Statistics on search volume (as a proportion of total search) for any given term, since 2004. To check it out, click here.

Google Finance

40 years’ worth of stock market data, updated in real time. To check it out, click here.

Google Books Ngrams

Search and analyze the full text of any of the millions of books digitised as part of the Google Books project. To check it out, click here.

National Climatic Data Center

Huge collection of environmental, meteorological and climate data sets from the US National Climatic Data Center. The world’s largest archive of weather data. To check it out, click here.

DBPedia

Wikipedia is comprised of millions of pieces of data, structured and unstructured on every subject under the sun. DBPedia is an ambitious project to catalogue and create a public, freely distributable database allowing anyone to analyze this data. To check it out, click here.

Topsy

Free, comprehensive social media data is hard to come by – after all their data is what generates profits for the big players (Facebook, Twitter etc) so they don’t want to give it away. However Topsy provides a searchable database of public tweets going back to 2006 as well as several tools to analyze the conversations. To check it out, click here.

Likebutton

Mines Facebook’s public data – globally and from your own network – to give an overview of what people “Like” at the moment. To check it out, click here.

New York Times

Searchable, indexed archive of news articles going back to 1851. To check it out, click here.

Freebase

A community-compiled database of structured data about people, places and things, with over 45 million entries. To check it out, click here.

Million Song Data Set

Metadata on over a million songs and pieces of music. Part of Amazon Web Services. To check it out, click here.”
See also Bernard Marr‘s blog at Big Data Guru

Pricey privacy: Framing the economy of information in the digital age


Paper by Federica Fornaciari in FirstMonday: “As new information technologies become ubiquitous, individuals are often prompted rethinking disclosure. Available media narratives may influence one’s understanding of the benefits and costs related to sharing personal information. This study, guided by frame theory, undertakes a Critical Discourse Analysis (CDA) of media discourse developed to discuss the privacy concerns related to the corporate collection and trade of personal information. The aim is to investigate the frames — the central organizing ideas — used in the media to discuss such an important aspect of the economics of personal data. The CDA explored 130 articles published in the New York Times between 2000 and 2012. Findings reveal that the articles utilized four frames: confusion and lack of transparency, justification and private interests, law and self-regulation, and commodification of information. Articles used episodic framing often discussing specific instances of infringements rather than broader thematic accounts. Media coverage tended to frame personal information as a commodity that may be traded, rather than as a fundamental value.”

Digital Sociology


New book by Deborah Lupton: “We now live in a digital society. New digital technologies have had a profound influence on everyday life, social relations, government, commerce, the economy and the production and dissemination of knowledge. People’s movements in space, their purchasing habits and their online communication with others are now monitored in detail by digital technologies. We are increasingly becoming digital data subjects, whether we like it or not, and whether we choose this or not.
The sub-discipline of digital sociology provides a means by which the impact, development and use of these technologies and their incorporation into social worlds, social institutions and concepts of selfhood and embodiment may be investigated, analysed and understood. This book introduces a range of interesting social, cultural and political dimensions of digital society and discusses some of the important debates occurring in research and scholarship on these aspects. It covers the new knowledge economy and big data, reconceptualising research in the digital era, the digitisation of higher education, the diversity of digital use, digital politics and citizen digital engagement, the politics of surveillance, privacy issues, the contribution of digital devices to embodiment and concepts of selfhood and many other topics.”

Code of Conduct: Cyber Crowdsourcing for Good


Patrick Meier at iRevolution: “There is currently no unified code of conduct for digital crowdsourcing efforts in the development, humanitarian or human rights space. As such, we propose the following principles (displayed below) as a way to catalyze a conversation on these issues and to improve and/or expand this Code of Conduct as appropriate.
This initial draft was put together by Kate ChapmanBrooke Simons and myself. The link above points to this open, editable Google Doc. So please feel free to contribute your thoughts by inserting comments where appropriate. Thank you.
An organization that launches a digital crowdsourcing project must:

  • Provide clear volunteer guidelines on how to participate in the project so that volunteers are able to contribute meaningfully.
  • Test their crowdsourcing platform prior to any project or pilot to ensure that the system will not crash due to obvious bugs.
  • Disclose the purpose of the project, exactly which entities will be using and/or have access to the resulting data, to what end exactly, over what period of time and what the expected impact of the project is likely to be.
  • Disclose whether volunteer contributions to the project will or may be used as training data in subsequent machine learning research
  • ….

An organization that launches a digital crowdsourcing project should:

  • Share as much of the resulting data with volunteers as possible without violating data privacy or the principle of Do No Harm.
  • Enable volunteers to opt out of having their tasks contribute to subsequent machine learning research. Provide digital volunteers with the option of having their contributions withheld from subsequent machine learning studies
  • … “

Seattle Launches Sweeping, Ethics-Based Privacy Overhaul


for the Privacy Advisor: “The City of Seattle this week launched a citywide privacy initiative aimed at providing greater transparency into the city’s data collection and use practices.
To that end, the city has convened a group of stakeholders, the Privacy Advisory Committee, comprising various government departments, to look at the ways the city is using data collected from practices as common as utility bill payments and renewing pet licenses or during the administration of emergency services like police and fire. By this summer, the committee will deliver the City Council suggested principles and a “privacy statement” to provide direction on privacy practices citywide.
In addition, the city has partnered with the University of Washington, where Jan Whittington, assistant professor of urban design and planning and associate director at the Center for Information Assurance and Cybersecurity, has been given a $50,000 grant to look at open data, privacy and digital equity and how municipal data collection could harm consumers.
Responsible for all things privacy in this progressive city is Michael Mattmiller, who was hired to the position of chief technology officer (CTO) for the City of Seattle in June. Before his current gig, he worked as a senior strategist in enterprise cloud privacy for Microsoft. He said it’s an exciting time to be at the helm of the office because there’s momentum, there’s talent and there’s intention.
“We’re at this really interesting time where we have a City Council that strongly cares about privacy … We have a new police chief who wants to be very good on privacy … We also have a mayor who is focused on the city being an innovative leader in the way we interact with the public,” he said.
In fact, some City Council members have taken it upon themselves to meet with various groups and coalitions. “We have a really good, solid environment we think we can leverage to do something meaningful,” Mattmiller said….
Armbruster said the end goal is to create policies that will hold weight over time.
“I think when looking at privacy principles, from an ethical foundation, the idea is to create something that will last while technology dances around us,” she said, adding the principles should answer the question, “What do we stand for as a city and how do we want to move forward? So any technology that falls into our laps, we can evaluate and tailor or perhaps take a pass on as it falls under our ethical framework.”
The bottom line, Mattmiller said, is making a decision that says something about Seattle and where it stands.
“How do we craft a privacy policy that establishes who we want to be as a city and how we want to operate?” Mattmiller asked.”

The Creepy New Wave of the Internet


Review by Sue Halpern in the New York Review of Books from:

 
…So here comes the Internet’s Third Wave. In its wake jobs will disappear, work will morph, and a lot of money will be made by the companies, consultants, and investment banks that saw it coming. Privacy will disappear, too, and our intimate spaces will become advertising platforms—last December Google sent a letter to the SEC explaining how it might run ads on home appliances—and we may be too busy trying to get our toaster to communicate with our bathroom scale to notice. Technology, which allows us to augment and extend our native capabilities, tends to evolve haphazardly, and the future that is imagined for it—good or bad—is almost always historical, which is to say, naive.”

A World That Counts: Mobilising a Data Revolution for Sustainable Development


Executive Summary of the Report by the UN Secretary-General’s Independent Expert Advisory Group on a Data Revolution for Sustainable Development (IEAG): “Data are the lifeblood of decision-making and the raw material for accountability. Without high-quality data providing the right information on the right things at the right time; designing, monitoring and evaluating effective policies becomes almost impossible.
New technologies are leading to an exponential increase in the volume and types of data available, creating unprecedented possibilities for informing and transforming society and protecting the environment. Governments, companies, researchers and citizen groups are in a ferment of experimentation, innovation and adaptation to the new world of data, a world in which data are bigger, faster and more detailed than ever before. This is the data revolution.
Some are already living in this new world. But too many people, organisations and governments are excluded because of lack of resources, knowledge, capacity or opportunity. There are huge and growing inequalities in access to data and information and in the ability to use it.
Data needs improving. Despite considerable progress in recent years, whole groups of people are not being counted and important aspects of people’s lives and environmental conditions are still not measured. For people, this can lead to the denial of basic rights, and for the planet, to continued environmental degradation. Too often, existing data remain unused because they are released too late or not at all, not well-documented and harmonized, or not available at the level of detail needed for decision-making.
As the world embarks on an ambitious project to meet new Sustainable Development Goals (SDGs), there is an urgent need to mobilise the data revolution for all people and the whole planet in order to monitor progress, hold governments accountable and foster sustainable development. More diverse, integrated, timely and trustworthy information can lead to better decision-making and real-time citizen feedback. This in turn enables individuals, public and private institutions, and companies to make choices that are good for them and for the world they live in.
This report sets out the main opportunities and risks presented by the data revolution for sustain-able development. Seizing these opportunities and mitigating these risks requires active choices, especially by governments and international institutions. Without immediate action, gaps between developed and developing countries, between information-rich and information-poor people, and between the private and public sectors will widen, and risks of harm and abuses of human rights will grow.

An urgent call for action: Key recommendations

The strong leadership of the United Nations (UN) is vital for the success of this process. The Independent Expert Advisory Group (IEAG), established in August 2014, offers the UN Secretary-General several key recommendations for actions to be taken in the near future, summarised below:

  1. Develop a global consensus on principles and standards: The disparate worlds of public, private and civil society data and statistics providers need to be urgently brought together to build trust and confidence among data users. We propose that the UN establish a process whereby key stakeholders create a “Global Consensus on Data”, to adopt principles concerning legal, technical, privacy, geospatial and statistical standards which, among other things, will facilitate openness and information exchange and promote and protect human rights.
  2. Share technology and innovations for the common good: To create mechanisms through which technology and innovation can be shared and used for the common good, we propose
    to create a global “Network of Data Innovation Networks”, to bring together the organisations and experts in the field. This would: contribute to the adoption of best practices for improving the monitoring of SDGs, identify areas where common data-related infrastructures could address capacity problems and improve efficiency, encourage collaborations, identify critical research gaps and create incentives to innovate.
  3. New resources for capacity development: Improving data is a development agenda in
    its own right, and can improve the targeting of existing resources and spur new economic opportunities. Existing gaps can only be overcome through new investments and the strengthening of capacities. A new funding stream to support the data revolution for sustainable development should be endorsed at the “Third International Conference on Financing for Development”, in Addis Ababa in July 2015. An assessment will be needed of the scale of investments, capacity development and technology transfer that is required, especially for low income countries; and proposals developed for mechanisms to leverage the creativity and resources of the private sector. Funding will also be needed to implement an education program aimed at improving people’s, infomediaries’ and public servants’ capacity and data literacy to break down barriers between people and data.
  4. Leadership for coordination and mobilisation: A UN-led “Global Partnership for Sustainable Development Data” is proposed, tomobiliseandcoordinate the actions and institutions required to make the data revolution serve sustainable development, promoting several initiatives, such as:
    • A “World Forum on Sustainable Development Data” to bring together the whole data ecosystem to share ideas and experiences for data improvements, innovation, advocacy and technology transfer. The first Forum should take place at the end of 2015, once the SDGs are agreed;
    • A “Global Users Forum for Data for SDGs”, to ensure feedback loops between data producers and users, help the international community to set priorities and assess results;
    • Brokering key global public-private partnerships for data sharing.
  5. Exploit some quick wins on SDG data: Establishing a “SDGs data lab” to support the development of a first wave of SDG indicators, developing an SDG analysis and visualisation platform using the most advanced tools and features for exploring data, and building a dashboard from diverse data sources on ”the state of the world”.

Never again should it be possible to say “we didn’t know”. No one should be invisible. This is the world we want – a world that counts.”

OpenUp Corporate Data while Protecting Privacy


Article by Stefaan G. Verhulst and David Sangokoya, (The GovLab) for the OpenUp? Blog: “Consider a few numbers: By the end of 2014, the number of mobile phone subscriptions worldwide is expected to reach 7 billion, nearly equal to the world’s population. More than 1.82 billion people communicate on some form of social network, and almost 14 billion sensor-laden everyday objects (trucks, health monitors, GPS devices, refrigerators, etc.) are now connected and communicating over the Internet, creating a steady stream of real-time, machine-generated data.
Much of the data generated by these devices is today controlled by corporations. These companies are in effect “owners” of terabytes of data and metadata. Companies use this data to aggregate, analyze, and track individual preferences, provide more targeted consumer experiences, and add value to the corporate bottom line.
At the same time, even as we witness a rapid “datafication” of the global economy, access to data is emerging as an increasingly critical issue, essential to addressing many of our most important social, economic, and political challenges. While the rise of the Open Data movement has opened up over a million datasets around the world, much of this openness is limited to government (and, to a lesser extent, scientific) data. Access to corporate data remains extremely limited. This is a lost opportunity. If corporate data—in the form of Web clicks, tweets, online purchases, sensor data, call data records, etc.—were made available in a de-identified and aggregated manner, researchers, public interest organizations, and third parties would gain greater insights on patterns and trends that could help inform better policies and lead to greater public good (including combatting Ebola).
Corporate data sharing holds tremendous promise. But its potential—and limitations—are also poorly understood. In what follows, we share early findings of our efforts to map this emerging open data frontier, along with a set of reflections on how to safeguard privacy and other citizen and consumer rights while sharing. Understanding the practice of shared corporate data—and assessing the associated risks—is an essential step in increasing access to socially valuable data held by businesses today. This is a challenge certainly worth exploring during the forthcoming OpenUp conference!
Understanding and classifying current corporate data sharing practices
Corporate data sharing remains very much a fledgling field. There has been little rigorous analysis of different ways or impacts of sharing. Nonetheless, our initial mapping of the landscape suggests there have been six main categories of activity—i.e., ways of sharing—to date:…
Assessing risks of corporate data sharing
Although the shared corporate data offers several benefits for researchers, public interest organizations, and other companies, there do exist risks, especially regarding personally identifiable information (PII). When aggregated, PII can serve to help understand trends and broad demographic patterns. But if PII is inadequately scrubbed and aggregated data is linked to specific individuals, this can lead to identity theft, discrimination, profiling, and other violations of individual freedom. It can also lead to significant legal ramifications for corporate data providers….”