The big questions for research using personal data


 at Royal Society’s “Verba”: “We live in an era of data. The world is generating 1.7 million billion bytes of data every minute and the total amount of global data is expected to grow 40% year on year for the next decade (PDF). In 2003 scientists declared the mapping of the human genome complete. It took over 10 years and cost $1billion – today it takes mere days and can be done at a fraction of the cost.

Making the most of the data revolution will be key to future scientific and economic progress. Unlocking the value of data by improving the way that we collect, analyse and use data has the potential to improve lives across a multitude of areas, ranging from business to health, and from tackling climate change to aiding civic engagement. However, its potential for public benefit must be balanced against the need for data to be used intelligently and with respect for individuals’ privacy.

Getting regulation right

The UK Data Protection Act was transposed into UK law following the 1995 European Data Protection Directive. This was at a time before wide-spread use of internet and smartphones. In 2012, recognising the pace of technological change, the European Commission proposed a comprehensive reform of EU data protection rules including a new Data Protection Regulation that would update and harmonise these rules across the EU.

The draft regulation is currently going through the EU legislative process. During this, the European Parliament has proposed changes to the Commission’s text. These changes have raised concerns for researchers across Europe that the Regulation could risk restricting the use of personal data for research which could prevent much vital health research. For example, researchers currently use these data to better understand how to prevent and treat conditions such as cancer, diabetes and dementia. The final details of the regulation are now being negotiated and the research community has come together to highlight the importance of data in research and articulate their concerns in a joint statement, which the Society supports.

The Society considers that datasets should be managed according to a system of proportionate governance. Personal data should only be shared if it is necessary for research with the potential for high public value and should be proportionate to the particular needs of a research project. It should also draw on consent, authorisation and safe havens – secure sites for databases containing sensitive personal data that can only be accessed by authorised researchers – as appropriate…..

However, many challenges remain that are unlikely to be resolved in the current European negotiations. The new legislation covers personal data but not anonymised data, which are data that have had information that can identify persons removed or replaced with a code. The assumption is that anonymisation is a foolproof way to protect personal identity. However, there have been examples of reidentification from anonymised data and computer scientists have long pointed out the flaws of relying on anonymisation to protect an individual’s privacy….There is also a risk of leaving the public behind with lack of information and failed efforts to earn trust; and it is clear that a better understanding of the role of consent and ethical governance is needed to ensure the continuation of cutting edge research which respects the principles of privacy.

These are problems that will require attention, and questions that the Society will continue to explore. …(More)”

The importance of human innovation in A.I. ethics


John C. Havens at Mashable: “….While welcoming the feedback that sensors, data and Artificial Intelligence provide, we’re at a critical inflection point. Demarcating the parameters between assistance and automation has never been more central to human well-being. But today, beauty is in the AI of the beholder. Desensitized to the value of personal data, we hemorrhage precious insights regarding our identity that define the moral nuances necessary to navigate algorithmic modernity.

If no values-based standards exist for Artificial Intelligence, then the biases of its manufacturers will define our universal code of human ethics. But this should not be their cross to bear alone. It’s time to stop vilifying the AI community and start defining in concert with their creations what the good life means surrounding our consciousness and code.

The intention of the ethics

“Begin as you mean to go forward.” Michael Stewart is founder, chairman & CEO of Lucid, an Artificial Intelligence company based in Austin that recently announced the formation of the industry’s first Ethics Advisory Panel (EAP). While Google claimed creation of a similar board when acquiring AI firm DeepMind in January 2014, no public realization of its efforts currently exist (as confirmed by a PR rep from Google for this piece). Lucid’s Panel, by comparison, has already begun functioning as a separate organization from the analytics side of the business and provides oversight for the company and its customers. “Our efforts,” Stewart says, “are guided by the principle that our ethics group is obsessed with making sure the impact of our technology is good.”

Kay Firth-Butterfield is chief officer of the EAP, and is charged with being on the vanguard of the ethical issues affecting the AI industry and society as a whole. Internally, the EAP provides the hub of ethical behavior for the company. Someone from Firth-Butterfield’s office even sits on all core product development teams. “Externally,” she notes, “we plan to apply Cyc intelligence (shorthand for ‘encyclopedia,’ Lucid’s AI causal reasoning platform) for research to demonstrate the benefits of AI and to advise Lucid’s leadership on key decisions, such as the recent signing of the LAWS letter and the end use of customer applications.”

Ensuring the impact of AI technology is positive doesn’t happen by default. But as Lucid is demonstrating, ethics doesn’t have to stymie innovation by dwelling solely in the realm of risk mitigation. Ethical processes aligning with a company’s core values can provide more deeply relevant products and increased public trust. Transparently including your customer’s values in these processes puts the person back into personalization….(Mashable)”

Policies for a fair re-use of data: Big Data and the application of anonymization techniques


Paper by Giuseppe D’Acquisto: “The legal framework on data protection in Europe sets a high standard with regard to the possible re-use of personal data. Principles like purpose limitation and data minimization challenge the emerging Big Data paradigm, where the “value” of data is linked to its somehow still unpredictable potential future uses. Nevertheless, the re-use of data is not impossible, once they are properly anonymized. The EU’s Article 29 Working Party published in 2014 an Opinion on the application of anonymization techniques, which can be implemented to enable potential re-use of previously collected data within a framework of safeguards for individuals. The paper reviews the main elements of the Opinion, with a view to the widespread adoption of anonymization policies enabling a fair re-use of data, and gives an overview of the legal and technical aspects related to anonymization, pointing out the many misconceptions on the issue…(More)”

Meaningful Consent: The Economics of Privity in Networked Environments


Paper by Jonathan Cave: “Recent work on privacy (e.g. WEIS 2013/4, Meaningful Consent in the Digital Economy project) recognises the unanticipated consequences of data-centred legal protections in a world of shifting relations between data and human actors. But the rules have not caught up with these changes, and the irreversible consequences of ‘make do and mend’ are not often taken into account when changing policy.

Many of the most-protected ‘personal’ data are not personal at all, but are created to facilitate the operation of larger (e.g. administrative, economic, transport) systems or inadvertently generated by using such systems. The protection given to such data typically rests on notions of informed consent even in circumstances where such consent may be difficult to define, harder to give and nearly impossible to certify in meaningful ways. Such protections typically involve a mix of data collection, access and processing rules that are either imposed on behalf of individuals or are to be exercised by them. This approach adequately protects some personal interests, but not all – and is definitely not future-proof. Boundaries between allowing individuals to discover and pursue their interests on one side and behavioural manipulation on the other are often blurred. The costs (psychological and behavioural as well as economic and practical) of exercising control over one’s data are rarely taken into account as some instances of the Right to be Forgotten illustrate. The purposes for which privacy rights were constructed are often forgotten, or have not been reinterpreted in a world of ubiquitous monitoring data, multi-person ‘private exchanges,’ and multiple pathways through which data can be used to create and to capture value. Moreover, the parties who should be involved in making decisions – those connected by a network of informational relationships – are often not in contractual, practical or legal contact. These developments, associated with e.g. the Internet of Things, Cloud computing and big data analytics, should be recognised as challenging privacy rules and, more fundamentally, the adequacy of informed consent (e.g. to access specified data for specified purposes) as a means of managing innovative, flexible, and complex informational architectures.

This paper presents a framework for organising these challenges using them to evaluate proposed policies, specifically in relation to complex, automated, automatic or autonomous data collection, processing and use. It argues for a movement away from a system of property rights based on individual consent to a values-based ‘privity’ regime – a collection of differentiated (relational as well as property) rights and consents that may be better able to accommodate innovations. Privity regimes (see deFillipis 2006) bundle together rights regarding e.g. confidential disclosure with ‘standing’ or voice options in relation to informational linkages.

The impacts are examined through a game-theoretic comparison between the proposed privity regime and existing privacy rights in personal data markets that include: conventional ‘behavioural profiling’ and search; situations where third parties may have complementary roles conflicting interests in such data and where data have value in relation both to specific individuals and to larger groups (e.g. ‘real-world’ health data); n-sided markets on data platforms (including social and crowd-sourcing platforms with long and short memories); and the use of ‘privity-like’ rights inherited by data objects and by autonomous systems whose ownership may be shared among many people….(More)”

Can big databases be kept both anonymous and useful?


The Economist: “….The anonymisation of a data record typically means the removal from it of personally identifiable information. Names, obviously. But also phone numbers, addresses and various intimate details like dates of birth. Such a record is then deemed safe for release to researchers, and even to the public, to make of it what they will. Many people volunteer information, for example to medical trials, on the understanding that this will happen.

But the ability to compare databases threatens to make a mockery of such protections. Participants in genomics projects, promised anonymity in exchange for their DNA, have been identified by simple comparison with electoral rolls and other publicly available information. The health records of a governor of Massachusetts were plucked from a database, again supposedly anonymous, of state-employee hospital visits using the same trick. Reporters sifting through a public database of web searches were able to correlate them in order to track down one, rather embarrassed, woman who had been idly searching for single men. And so on.

Each of these headline-generating stories creates a demand for more controls. But that, in turn, deals a blow to the idea of open data—that the electronic “data exhaust” people exhale more or less every time they do anything in the modern world is actually useful stuff which, were it freely available for analysis, might make that world a better place.

Of cake, and eating it

Modern cars, for example, record in their computers much about how, when and where the vehicle has been used. Comparing the records of many vehicles, says Viktor Mayer-Schönberger of the Oxford Internet Institute, could provide a solid basis for, say, spotting dangerous stretches of road. Similarly, an opening of health records, particularly in a country like Britain, which has a national health service, and cross-fertilising them with other personal data, might help reveal the multifarious causes of diseases like Alzheimer’s.

This is a true dilemma. People want both perfect privacy and all the benefits of openness. But they cannot have both. The stripping of a few details as the only means of assuring anonymity, in a world choked with data exhaust, cannot work. Poorly anonymised data are only part of the problem. What may be worse is that there is no standard for anonymisation. Every American state, for example, has its own prescription for what constitutes an adequate standard.

Worse still, devising a comprehensive standard may be impossible. Paul Ohm of Georgetown University, in Washington, DC, thinks that this is partly because the availability of new data constantly shifts the goalposts. “If we could pick an industry standard today, it would be obsolete in short order,” he says. Some data, such as those about medical conditions, are more sensitive than others. Some data sets provide great precision in time or place, others merely a year or a postcode. Each set presents its own dangers and requirements.

Fortunately, there are a few easy fixes. Thanks in part to the headlines, many now agree that public release of anonymised data is a bad move. Data could instead be released piecemeal, or kept in-house and accessible by researchers through a question-and-answer mechanism. Or some users could be granted access to raw data, but only in strictly controlled conditions.

All these approaches, though, are anathema to the open-data movement, because they limit the scope of studies. “If we’re making it so hard to share that only a few have access,” says Tim Althoff, a data scientist at Stanford University, “that has profound implications for science, for people being able to replicate and advance your work.”

Purely legal approaches might mitigate that. Data might come with what have been called “downstream contractual obligations”, outlining what can be done with a given data set and holding any onward recipients to the same standards. One perhaps draconian idea, suggested by Daniel Barth-Jones, an epidemiologist at Columbia University, in New York, is to make it illegal even to attempt re-identification….(More).”

How Our Days Became Numbered


Review by Clive Cookson of ‘How Our Days Became Numbered’, by Dan Bouk in the Financial Times: “The unemployed lumber worker whose 1939 portrait adorns the cover of How Our Days Became Numbered has a “face fit for a film star”, as Dan Bouk puts it. But he is not there for his looks. Bouk wants us to focus on his bulging bicep, across which is tattooed “SSN 535-07-5248”: his social security number.

The photograph of Thomas Cave by documentary photographer Dorothea Lange illustrates the high water mark of American respect for statistical labelling. Cave was so proud of his newly issued number that he had it inked forever on his arm.

When the Roosevelt administration introduced the federal social security system in the 1930s, it worked out rates of contribution and benefit on the basis of statistical practices already established by life insurance companies. The industry is at the heart of Bouk’s history of personal data collection and analysis — because it worked out how to measure and predict the health of ordinary Americans in the late 19th and early 20th centuries. (More)”

‘Smart Cities’ Will Know Everything About You


Mike Weston in the Wall Street Journal: “From Boston to Beijing, municipalities and governments across the world are pledging billions to create “smart cities”—urban areas covered with Internet-connected devices that control citywide systems, such as transit, and collect data. Although the details can vary, the basic goal is to create super-efficient infrastructure, aid urban planning and improve the well-being of the populace.

A byproduct of a tech utopia will be a prodigious amount of data collected on the inhabitants. For instance, at the company I head, we recently undertook an experiment in which some staff volunteered to wear devices around the clock for 10 days. We monitored more than 170 metrics reflecting their daily habits and preferences—including how they slept, where they traveled and how they felt (a fast heart rate and no movement can indicate excitement or stress).

If the Internet age has taught us anything, it’s that where there is information, there is money to be made. With so much personal information available and countless ways to use it, businesses and authorities will be faced with a number of ethical questions.

In a fully “smart” city, every movement an individual makes can be tracked. The data will reveal where she works, how she commutes, her shopping habits, places she visits and her proximity to other people. You could argue that this sort of tracking already exists via various apps and on social-media platforms, or is held by public-transport companies and e-commerce sites. The difference is that with a smart city this data will be centralized and easy to access. Given the value of this data, it’s conceivable that municipalities or private businesses that pay to create a smart city will seek to recoup their expenses by selling it….

Recent history—issues of privacy and security on social networks and chatting apps, and questions about how intellectual-property regulations apply online—has shown that the law has been slow to catch up with digital innovations. So businesses that can purchase smart-city data will be presented with many strategic and ethical concerns.

What degree of targeting is too specific and violates privacy? Should businesses limit the types of goods or services they offer to certain individuals? Is it ethical for data—on an employee’s eating habits, for instance—to be sold to employers or to insurance companies to help them assess claims? Do individuals own their own personal data once it enters the smart-city system?

With or without stringent controlling legislation, businesses in a smart city will need to craft their own policies and procedures regarding the use of data. A large-scale misuse of personal data could provoke a consumer backlash that could cripple a company’s reputation and lead to monster lawsuits. An additional problem is that businesses won’t know which individuals might welcome the convenience of targeted advertising and which will find it creepy—although data science could solve this equation eventually by predicting where each individual’s privacy line is.

A smart city doesn’t have to be as Orwellian as it sounds. If businesses act responsibly, there is no reason why what sounds intrusive in the abstract can’t revolutionize the way people live for the better by offering services that anticipates their needs; by designing ultraefficient infrastructure that makes commuting a (relative) dream; or with a revolutionary approach to how energy is generated and used by businesses and the populace at large….(More)”

The case for data ethics


Steven Tiell at Accenture: “Personal data is the coin of the digital realm, which for business leaders creates a critical dilemma. Companies are being asked to gather more types of data faster than ever to maintain a competitive edge in the digital marketplace; at the same time, however, they are being asked to provide pervasive and granular control mechanisms over the use of that data throughout the data supply chain.

The stakes couldn’t be higher. If organizations, or the platforms they use to deliver services, fail to secure personal data, they expose themselves to tremendous risk—from eroding brand value and the hard-won trust of established vendors and customers to ceding market share, from violating laws to costing top executives their jobs.

To distinguish their businesses in this marketplace, leaders should be asking themselves two questions. What are the appropriate standards and practices our company needs to have in place to govern the handling of data? And how can our company make strong data controls a value proposition for our employees, customers and partners?

Defining effective compliance activities to support legal and regulatory obligations can be a starting point. However, mere compliance with existing regulations—which are, for the most part, focused on privacy—is insufficient. Respect for privacy is a byproduct of high ethical standards, but it is only part of the picture. Companies need to embrace data ethics, an expansive set of practices and behaviors grounded in a moral framework for the betterment of a community (however defined).

 RAISING THE BAR

Why ethics? When communities of people—in this case, the business community at large—encounter new influences, the way they respond to and engage with those influences becomes the community’s shared ethics. Individuals who behave in accordance with these community norms are said to be moral, and those who are exemplary are able to gain the trust of their community.

Over time, as ethical standards within a community shift, the bar for trustworthiness is raised on the assumption that participants in civil society must, at a minimum, adhere to the rule of law. And thus, to maintain moral authority and a high degree of trust, actors in a community must constantly evolve to adopt the highest ethical standards.

Actors in the big data community, where security and privacy are at the core of relationships with stakeholders, must adhere to a high ethical standard to gain this trust. This requires them to go beyond privacy law and existing data control measures. It will also reward those who practice strong ethical behaviors and a high degree of transparency at every stage of the data supply chain. The most successful actors will become the platform-based trust authorities, and others will depend on these platforms for disclosure, sharing and analytics of big data assets.

Data ethics becomes a value proposition only once controls and capabilities are in place to granularly manage data assets at scale throughout the data supply chain. It is also beneficial when a community shares the same behavioral norms and taxonomy to describe the data itself, the ethical decision points along the data supply chain, and how those decisions lead to beneficial or harmful impacts….(More)”

Why Protecting Data Privacy Matters, and When


Anne Russell at Data Science Central: “It’s official. Public concerns over the privacy of data used in digital approaches have reached an apex. Worried about the safety of digital networks, consumers want to gain control over what they increasingly sense as a loss of power over how their data is used. It’s not hard to wonder why. Look at the extent of coverage on the U.S. Government data breach last month and the sheer growth in the number of attacks against government and others overall. Then there is the increasing coverage on the inherent security flaws built into the internet, through which most of our data flows. The costs of data breaches to individuals, industries, and government are adding up. And users are taking note…..
If you’re not sure whether the data fueling your approach will raise privacy and security flags, consider the following. When it comes to data privacy and security, not all data is going to be of equal concern. Much depends on the level of detail in data content, data type, data structure, volume, and velocity, and indeed how the data itself will be used and released.

First there is the data where security and privacy has always mattered and for which there is already an existing and well galvanized body of law in place. Foremost among these is classified or national security data where data usage is highly regulated and enforced. Other data for which there exists a considerable body of international and national law regulating usage includes:

  • Proprietary Data – specifically the data that makes up the intellectual capital of individual businesses and gives them their competitive economic advantage over others, including data protected under copyright, patent, or trade secret laws and the sensitive, protected data that companies collect on behalf of its customers;
  • Infrastructure Data – data from the physical facilities and systems – such as roads, electrical systems, communications services, etc. – that enable local, regional, national, and international economic activity; and
  • Controlled Technical Data – technical, biological, chemical, and military-related data and research that could be considered of national interest and be under foreign export restrictions….

The second group of data that raises privacy and security concerns is personal data. Commonly referred to as Personally Identifiable Information (PII), it is any data that distinguishes individuals from each other. It is also the data that an increasing number of digital approaches rely on, and the data whose use tends to raise the most public ire. …

A third category of data needing privacy consideration is the data related to good people working in difficult or dangerous places. Activists, journalists, politicians, whistle-blowers, business owners, and others working in contentious areas and conflict zones need secure means to communicate and share data without fear of retribution and personal harm.  That there are parts of the world where individuals can be in mortal danger for speaking out is one of the reason that TOR (The Onion Router) has received substantial funding from multiple government and philanthropic groups, even at the high risk of enabling anonymized criminal behavior. Indeed, in the absence of alternate secure networks on which to pass data, many would be in grave danger, including those such as the organizers of the Arab Spring in 2010 as well as dissidents in Syria and elsewhere….(More)”

 

When Guarding Student Data Endangers Valuable Research


Susan M. Dynarski  in the New York Times: “There is widespread concern over threats to privacy posed by the extensive personal data collected by private companies and public agencies.

Some of the potential danger comes from the government: The National Security Agency has swept up the telephone records of millions of people, in what it describes as a search for terrorists. Other threats are posed by hackers, who have exploited security gaps to steal data from retail giantslike Target and from the federal Office of Personnel Management.

Resistance to data collection was inevitable — and it has been particularly intense in education.

Privacy laws have already been strengthened in some states, and multiple bills now pending in state legislatures and in Congress would tighten the security and privacy of student data. Some of this proposed legislation is so broadly written, however, that it could unintentionally choke off the use of student data for its original purpose: assessing and improving education. This data has already exposed inequities, allowing researchers and advocates to pinpoint where poor, nonwhite and non-English-speaking children have been educated inadequately by their schools.

Data gathering in education is indeed extensive: Across the United States, large, comprehensive administrative data sets now track the academic progress of tens of millions of students. Educators parse this data to understand what is working in their schools. Advocates plumb the data to expose unfair disparities in test scores and graduation rates, building cases to target more resources for the poor. Researchers rely on this data when measuring the effectiveness of education interventions.

To my knowledge there has been no large-scale, Target-like theft of private student records — probably because students’ test scores don’t have the market value of consumers’ credit card numbers. Parents’ concerns have mainly centered not on theft, but on the sharing of student data with third parties, including education technology companies. Last year, parentsresisted efforts by the tech start-up InBloom to draw data on millions of students into the cloud and return it to schools as teacher-friendly “data dashboards.” Parents were deeply uncomfortable with a third party receiving and analyzing data about their children.

In response to such concerns, some pending legislation would scale back the authority of schools, districts and states to share student data with third parties, including researchers. Perhaps the most stringent of these proposals, sponsored by Senator David Vitter, a Louisiana Republican, would effectively end the analysis of student data by outside social scientists. This legislation would have banned recent prominent research documenting the benefits of smaller classes, the value of excellent teachersand the varied performance of charter schools.

Under current law, education agencies can share data with outside researchers only to benefit students and improve education. Collaborations with researchers allow districts and states to tap specialized expertise that they otherwise couldn’t afford. The Boston public school district, for example, has teamed up with early-childhood experts at Harvard to plan and evaluate its universal prekindergarten program.

In one of the longest-standing research partnerships, the University of Chicago works with the Chicago Public Schools to improve education. Partnerships like Chicago’s exist across the nation, funded by foundations and the United States Department of Education. In one initiative, a Chicago research consortium compiled reports showing high school principals that many of the seniors they had sent off to college swiftly dropped out without earning a degree. This information spurred efforts to improve high school counseling and college placement.

Specific, tailored information in the hands of teachers, principals or superintendents empowers them to do better by their students. No national survey could have told Chicago’s principals how their students were doing in college. Administrative data can provide this information, cheaply and accurately…(More)”