The Ethics of Hiding Your Data From the Machines


Molly Wood at Wired: “…But now that data is being used to train artificial intelligence, and the insights those future algorithms create could quite literally save lives.

So while targeted advertising is an easy villain, data-hogging artificial intelligence is a dangerously nuanced and highly sympathetic bad guy, like Erik Killmonger in Black Panther. And it won’t be easy to hate.

I recently met with a company that wants to do a sincerely good thing. They’ve created a sensor that pregnant women can wear, and it measures their contractions. It can reliably predict when women are going into labor, which can help reduce preterm births and C-sections. It can get women into care sooner, which can reduce both maternal and infant mortality.

All of this is an unquestionable good.

And this little device is also collecting a treasure trove of information about pregnancy and labor that is feeding into clinical research that could upend maternal care as we know it. Did you know that the way most obstetricians learn to track a woman’s progress through labor is based on a single study from the 1950s, involving 500 women, all of whom were white?…

To save the lives of pregnant women and their babies, researchers and doctors, and yes, startup CEOs and even artificial intelligence algorithms need data. To cure cancer, or at least offer personalized treatments that have a much higher possibility of saving lives, those same entities will need data….

And for we consumers, well, a blanket refusal to offer up our data to the AI gods isn’t necessarily the good choice either. I don’t want to be the person who refuses to contribute my genetic data via 23andMe to a massive research study that could, and I actually believe this is possible, lead to cures and treatments for diseases like Parkinson’s and Alzheimer’s and who knows what else.

I also think I deserve a realistic assessment of the potential for harm to find its way back to me, because I didn’t think through or wasn’t told all the potential implications of that choice—like how, let’s be honest, we all felt a little stung when we realized the 23andMe research would be through a partnership with drugmaker (and reliable drug price-hiker) GlaxoSmithKline. Drug companies, like targeted ads, are easy villains—even though this partnership actually couldproduce a Parkinson’s drug. But do we know what GSK’s privacy policy looks like? That deal was a level of sharing we didn’t necessarily expect….(More)”.

Stop the Open Data Bus, We Want to Get Off


Paper by Chris Culnane, Benjamin I. P. Rubinstein, and Vanessa Teague: “The subject of this report is the re-identification of individuals in the Myki public transport dataset released as part of the Melbourne Datathon 2018. We demonstrate the ease with which we were able to re-identify ourselves, our co-travellers, and complete strangers; our analysis raises concerns about the nature and granularity of the data released, in particular the ability to identify vulnerable or sensitive groups…..

This work highlights how a large number of passengers could be re-identified in the 2018 Myki data release, with detailed discussion of specific people. The implications of re-identification are potentially serious: ex-partners, one-time acquaintances, or other parties can determine places of home, work, times of travel, co-travelling patterns—presenting risk to vulnerable groups in particular…

In 2018 the Victorian Government released a large passenger centric transport dataset to a data science competition—the 2018 Melbourne Datathon. Access to the data was unrestricted, with a URL provided on the datathon’s website to download the complete dataset from an Amazon S3 Bucket. Over 190 teams continued to analyse the data through the 2 month competition period. The data consisted of touch on and touch off events for the Myki smart card ticketing system used throughout the state of Victoria, Australia. With such data, contestants would be able to apply retrospective analyses on an entire public transport system, explore suitability of predictive models, etc.

The Myki ticketing system is used across Victorian public transport: on trains, buses and trams. The dataset was a longitudinal dataset, consisting of touch on and touch off events from Week 27 in 2015 through to Week 26 in 2018. Each event contained a card identifier (cardId; not the actual card number), the card type, the time of the touch on or off, and various location information, for example a stop ID or route ID, along with other fields which we omit here for brevity. Events could be indexed by the cardId and as such, all the events associated with a single card could be retrieved. There are a total of 15,184,336 cards in the dataset—more than twice the 2018 population of Victoria. It appears that all touch on and off events for metropolitan trains and trams have been included, though other forms of transport such as intercity trains and some buses are absent. In total there are nearly 2 billion touch on and off events in the dataset.

No information was provided as to the de-identification that was performed on the dataset. Our analysis indicates that little to no de-identification took place on the bulk of the data, as will become evident in Section 3. The exception is the cardId, which appears to have been mapped in some way from the Myki Card Number. The exact mapping has not been discovered, although concerns remain as to its security effectiveness….(More)”.

Data Management Law for the 2020s: The Lost Origins and the New Needs


Paper by Przemysław Pałka: “In the data analytics society, each individual’s disclosure of personal information imposes costs on others. This disclosure enables companies, deploying novel forms of data analytics, to infer new knowledge about other people and to use this knowledge to engage in potentially harmful activities. These harms go beyond privacy and include difficult to detect price discrimination, preference manipulation, and even social exclusion. Currently existing, individual-focused, data protection regimes leave law unable to account for these social costs or to manage them. 

This Article suggests a way out, by proposing to re-conceptualize the problem of social costs of data analytics through the new frame of “data management law.” It offers a critical comparison of the two existing models of data governance: the American “notice and choice” approach and the European “personal data protection” regime (currently expressed in the GDPR). Tracing their origin to a single report issued in 1973, the article demonstrates how they developed differently under the influence of different ideologies (market-centered liberalism, and human rights, respectively). It also shows how both ultimately failed at addressing the challenges outlined already forty-five years ago. 

To tackle these challenges, this Article argues for three normative shifts. First, it proposes to go beyond “privacy” and towards “social costs of data management” as the framework for conceptualizing and mitigating negative effects of corporate data usage. Second, it argues to go beyond the individual interests, to account for collective ones, and to replace contracts with regulation as the means of creating norms governing data management. Third, it argues that the nature of the decisions about these norms is political, and so political means, in place of technocratic solutions, need to be employed….(More)”.

The Data Protection Officer Handbook


Handbook by Douwe Korff and Marie Georges: “This Handbook was prepared for and is used in the EU-funded  “T4DATA” training‐of-trainers programme. Part I explains the history and development of European data protection law and provides an overview of European data protection instruments including the Council of Europe Convention and its “Modernisation” and the various EU data protection instruments relating to Justice and Home Affairs, the CFSP and the EU institutions, before focusing on the GDPR in Part II. The final part (Part III) consists of detailed practical advice on the various tasks of the Data Protection Officer now institutionalised by the GDPR. Although produced for the T4DATA programme that focusses on DPOs in the public sector, it is hoped that the Handbook will be useful also to anyone else interested in the application of the GDPR, including DPOs in the private sector….(More)”.

Guidance Note: Statistical Disclosure Control


Centre for Humanitarian Data: “Survey and needs assessment data, or what is known as ‘microdata’, is essential for providing adequate response to crisis-affected people. However, collecting this information does present risks. Even as great effort is taken to remove unique identifiers such as names and phone numbers from microdata so no individual persons or communities are exposed, combining key variables such as location or ethnicity can still allow for re-identification of individual respondents. Statistical Disclosure Control (SDC) is one method for reducing this risk. 

The Centre has developed a Guidance Note on Statistical Disclosure Control that outlines the steps involved in the SDC process, potential applications for its use, case studies and key actions for humanitarian data practitioners to take when managing sensitive microdata. Along with an overview of what SDC is and what tools are available, the Guidance Note outlines how the Centre is using this process to mitigate risk for datasets shared on HDX. …(More)”.

Concerns About Online Data Privacy Span Generations


Internet Innovations Alliance: “Are Millennials okay with the collection and use of their data online because they grew up with the internet?

In an effort to help inform policymakers about the views of Americans across generations on internet privacy, the Internet Innovation Alliance, in partnership with Icon Talks, the Hispanic Technology & Telecommunications Partnership (HTTP), and the Millennial Action Project, commissioned a national study of U.S. consumers who have witnessed a steady stream of online privacy abuses, data misuses, and security breaches in recent years. The survey examined the concerns of U.S. adults—overall and separated by age group, as well as other demographics—regarding the collection and use of personal data and location information by tech and social media companies, including tailoring the online experience, the potential for their personal financial information to be hacked from online tech and social media companies, and the need for a single, national policy addressing consumer data privacy.

Download: “Concerns About Online Data Privacy Span Generations” IIA white paper pdf.

Download: “Consumer Data Privacy Concerns” Civic Science report pdf….(More)”

Data Is a Development Issue


Paper by Susan Ariel Aaronson: “Many wealthy states are transitioning to a new economy built on data. Individuals and firms in these states have expertise in using data to create new goods and services as well as in how to use data to solve complex problems. Other states may be rich in data but do not yet see their citizens’ personal data or their public data as an asset. Most states are learning how to govern and maintain trust in the data-driven economy; however, many developing countries are not well positioned to govern data in a way that encourages development. Meanwhile, some 76 countries are developing rules and exceptions to the rules governing cross-border data flows as part of new negotiations on e-commerce. This paper uses a wide range of metrics to show that most developing and middle-income countries are not ready or able to provide an environment where their citizens’ personal data is protected and where public data is open and readily accessible. Not surprisingly, greater wealth is associated with better scores on all the metrics. Yet, many industrialized countries are also struggling to govern the many different types and uses of data. The paper argues that data governance will be essential to development, and that donor nations have a responsibility to work with developing countries to improve their data governance….(More)”.

The personification of big data


Paper by Stevenson, Phillip Douglas and Mattson, Christopher Andrew: “Organizations all over the world, both national and international, gather demographic data so that the progress of nations and peoples can be tracked. This data is often made available to the public in the form of aggregated national level data or individual responses (microdata). Product designers likewise conduct surveys to better understand their customer and create personas. Personas are archetypes of the individuals who will use, maintain, sell or otherwise be affected by the products created by designers. Personas help designers better understand the person the product is designed for. Unfortunately, the process of collecting customer information and creating personas is often a slow and expensive process.

In this paper, we introduce a new method of creating personas, leveraging publicly available databanks of both aggregated national level and information on individuals in the population. A computational persona generator is introduced that creates a population of personas that mirrors a real population in terms of size and statistics. Realistic individual personas are filtered from this population for use in product development…(More)”.

Responding to Some Challenges Posed by the Reidentification of Anonymized Personal Data


Paper by Herman T. Tavani and Frances S. Grodzinsky: “In this paper, we examine a cluster of ethical controversies generated by the reidentification of anonymized personal data in the context of big data analytics, with particular attention to the implications for personal privacy. Our paper is organized into two main parts. Part One examines some ethical problems involving re-identification of personally identifiable information (PII) in large data sets. Part Two begins with a brief description of Moor and Weckert’s Dynamic Ethics (DE) and Nissenbaum’s Contextual Integrity (CI) Frameworks. We then investigate whether these frameworks, used together, can provide us with a more robust scheme for analyzing privacy concerns that arise in the re-identification process (as well as within the larger context of big data analytics). This paper does not specifically address re-identification-related privacy concerns that arise in the context of the European Union’s General Data Protection Regulation (GDPR). Instead, we examine those issues in a separate work….(More)”.

“Anonymous” Data Won’t Protect Your Identity


Sophie Bushwick at Scientific American: “The world produces roughly 2.5 quintillion bytes of digital data per day, adding to a sea of information that includes intimate details about many individuals’ health and habits. To protect privacy, data brokers must anonymize such records before sharing them with researchers and marketers. But a new study finds it is relatively easy to reidentify a person from a supposedly anonymized data set—even when that set is incomplete.

Massive data repositories can reveal trends that teach medical researchers about disease, demonstrate issues such as the effects of income inequality, coach artificial intelligence into humanlike behavior and, of course, aim advertising more efficiently. To shield people who—wittingly or not—contribute personal information to these digital storehouses, most brokers send their data through a process of deidentification. This procedure involves removing obvious markers, including names and social security numbers, and sometimes taking other precautions, such as introducing random “noise” data to the collection or replacing specific details with general ones (for example, swapping a birth date of “March 7, 1990” for “January–April 1990”). The brokers then release or sell a portion of this information.

“Data anonymization is basically how, for the past 25 years, we’ve been using data for statistical purposes and research while preserving people’s privacy,” says Yves-Alexandre de Montjoye, an assistant professor of computational privacy at Imperial College London and co-author of the new study, published this week in Nature Communications.  Many commonly used anonymization techniques, however, originated in the 1990s, before the Internet’s rapid development made it possible to collect such an enormous amount of detail about things such as an individual’s health, finances, and shopping and browsing habits. This discrepancy has made it relatively easy to connect an anonymous line of data to a specific person: if a private detective is searching for someone in New York City and knows the subject is male, is 30 to 35 years old and has diabetes, the sleuth would not be able to deduce the man’s name—but could likely do so quite easily if he or she also knows the target’s birthday, number of children, zip code, employer and car model….(More)”