Privacy-Preserving Record Linkage in the context of a National Statistics Institute


Guidance by Rainer Schnell: “Linking existing administrative data sets on the same units is used increasingly as a research strategy in many different fields. Depending on the academic field, this kind of operation has been given different names, but in application areas, this approach is mostly denoted as record linkage. Although linking data on organisations or economic entities is common, the most interesting applications of record linkage concern data on persons. Starting in medicine, this approach is now also being used in the social sciences and official statistics. Furthermore, the joint use of survey data with administrative data is now standard practice. For example, victimisation surveys are linked to police records, labour force surveys are linked to social security databases, and censuses are linked to surveys.

Merging different databases containing information on the same unit is technically trivial if all involved databases have a common identification number, such as a social security number or, as in the Scandinavian countries, a permanent personal identification number. Most of the modern identification numbers contain checksum mechanisms so that errors in these identifiers can be easily detected and corrected. Due to the many advantages of permanent personal identification numbers, similar systems have been introduced or discussed in some European countries outside Scandinavia.

In many jurisdictions, no permanent personal identification number is available for linkage. Examples are New Zealand, Australia, the UK, and Germany. Here, the linkage is most often based on alphanumeric identifiers such as surname, first name, address, and place of birth. In the literature, such identifiers are most often denoted as indirect or quasi-identifiers. Such identifiers are prone to error, for example, due to typographical errors, memory faults (previous addresses), different recordings of the same identifier (for example, swapping of substrings: reversal of first name and last name), deliberately false information (for example, year of birth) or changes of values over time (for example name changes due to marriages). Linking on exact matching information, therefore, yields only a non-randomly selected subset of records.

Furthermore, the quality of identifying information in databases containing only indirect identifiers is much lower than usually expected. Error rates in excess of 20% and more records containing incomplete or erroneous identifiers are encountered in practice….(More)”.

Personal data, public data, privacy & power: GDPR & company data


Open Corporates: “…there are three other aspects which are relevant when talking about access to EU company data.

Cargo-culting GDPR

The first, is a tendency to take this complex and subtle legislation that is GDPR and use a poorly understood version in other legislation and regulation, even if that regulation is already covered by GDPR. This actually undermines the GDPR regime, and prevents it from working effectively, and should strongly be resisted. In the tech world, such approaches are called ‘cargo-culting’.

Similarly GDPR is often used as an excuse for not releasing company information as open data, even when the same data is being sold to third parties apparently without concerns — if one is covered by GDPR, the other certainly should be.

Widened power asymmetries

The second issue is the unintended consequences of GDPR, specifically the way it increases asymmetries of power and agency. For example, something like the so-called Right To Be Forgotten takes very significant resources to implement, and so actually strengthens the position of the giant tech companies — for such companies, investing millions in large teams to decide who should and should not be given the Right To Be Forgotten is just a relatively small cost of doing business.

Another issue is the growth of a whole new industry dedicated to removing traces of people’s past from the internet (2), which is also increasing the asymmetries of power. The vast majority of people are not directors of companies, or beneficial owners, and it is only the relatively rich and powerful (including politicians and criminals) who can afford lawyers to stifle free speech, or remove parts of their past they would rather not be there, from business failures to associations with criminals.

OpenCorporates, for example, was threatened with a lawsuit from a member of one of the wealthiest families in Europe for reproducing a gazette notice from the Luxembourg official gazette (a publication that contains public notices). We refused to back down, believing we had a good case in law and in the public interest, and the other side gave up. But such so-called SLAPP suits are becoming increasingly common, although unlike many US states there are currently no defences in place to resist these in the EU, despite pressure from civil society to address this….

At the same time, the automatic assumption that all Personally Identifiable Information (PII), someone’s name for example, is private is highly problematic, confusing both citizens and policy makers, and further undermining democracies and fair societies. As an obvious case, it’s critical that we know the names of our elected representatives, and those in positions of power, otherwise we would have an opaque society where decisions are made by nameless individuals with opaque agendas and personal interests — such as a leader awarding a contract to their brother’s company, for example.

As the diagram below illustrates, there is some personally identifiable information that it’s strongly in the public interest to know. Take the director or beneficial owner of a company, for example, of course their details are PII — clearly you need to know their name (and other information too), otherwise what actually do you know about them, or the company (only that some unnamed individual has been given special protection under law to be shielded from the company’s debts and actions, and yet can benefit from its profits)?

On the other hand, much of the data which is truly about our privacy — the profiles, inferences and scores that companies store on us — is explicitly outside GDPR, if it doesn’t contain PII.

Image for post

Hopefully, as awareness of the issues increases, we will develop a more nuanced, deeper, understanding of privacy, such that case law around GDPR, and successors to this legislation begin to rebalance and case law starts to bring clarity to the ambiguities of the GDPR….(More)”.

Health Data Privacy under the GDPR: Big Data Challenges and Regulatory Responses


Book edited by Maria Tzanou: “The growth of data collecting goods and services, such as ehealth and mhealth apps, smart watches, mobile fitness and dieting apps, electronic skin and ingestible tech, combined with recent technological developments such as increased capacity of data storage, artificial intelligence and smart algorithms have spawned a big data revolution that has reshaped how we understand and approach health data. Recently the COVID-19 pandemic has foregrounded a variety of data privacy issues. The collection, storage, sharing and analysis of health- related data raises major legal and ethical questions relating to privacy, data protection, profiling, discrimination, surveillance, personal autonomy and dignity.

This book examines health privacy questions in light of the GDPR and the EU’s general data privacy legal framework. The GDPR is a complex and evolving body of law that aims to deal with several technological and societal health data privacy problems, while safeguarding public health interests and addressing its internal gaps and uncertainties. The book answers a diverse range of questions including: What role can the GDPR play in regulating health surveillance and big (health) data analytics? Can it catch up with the Internet age developments? Are the solutions to the challenges posed by big health data to be found in the law? Does the GDPR provide adequate tools and mechanisms to ensure public health objectives and the effective protection of privacy? How does the GDPR deal with data that concern children’s health and academic research?

By analysing a number of diverse questions concerning big health data under the GDPR from various different perspectives, this book will appeal to those interested in privacy, data protection, big data, health sciences, information technology, the GDPR, EU and human rights law….(More)”.

Data Justice and COVID-19: Global Perspectives


Book edited by edited by Linnet Taylor, Aaron Martin, Gargi Sharma and Shazade Jameson: “In early 2020, as the COVID-19 pandemic swept the world and states of emergency were declared by one country after another, the global technology sector—already equipped with unprecedented wealth, power, and influence—mobilised to seize the opportunity. This collection is an account of what happened next and captures the emergent conflicts and responses around the world. The essays provide a global perspective on the implications of these developments for justice: they make it possible to compare how the intersection of state and corporate power—and the way that power is targeted and exercised—confronts, and invites resistance from, civil society in countries worldwide.

This edited volume captures the technological response to the pandemic in 33 countries, accompanied by nine thematic reflections, and reflects the unfolding of the first wave of the pandemic.

This book can be read as a guide to the landscape of technologies deployed during the pandemic and also be used to compare individual country strategies. It will prove useful as a tool for teaching and learning in various disciplines and as a reference point for activists and analysts interested in issues of data justice.

The essays interrogate these technologies and the political, legal, and regulatory structures that determine how they are applied. In doing so,the book exposes the workings of state technological power to critical assessment and contestation….(More)”

Strengthening Privacy Protections in COVID-19 Mobile Phone–Enhanced Surveillance Programs


Rand Report: “Dozens of countries, including the United States, have been using mobile phone tools and data sources for COVID-19 surveillance activities, such as tracking infections and community spread, identifying populated areas at risk, and enforcing quarantine orders. These tools can augment traditional epidemiological interventions, such as contact tracing with technology-based data collection (e.g., automated signaling and record-keeping on mobile phone apps). As the response progresses, other beneficial technologies could include tools that authenticate those with low risk of contagion or that build community trust as stay-at-home orders are lifted.

However, the potential benefits that COVID-19 mobile phone–enhanced public health (“mobile”) surveillance program tools could provide are also accompanied by potential for harm. There are significant risks to citizens from the collection of sensitive data, including personal health, location, and contact data. People whose personal information is being collected might worry about who will receive the data, how those recipients might use the data, how the data might be shared with other entities, and what measures will be taken to safeguard the data from theft or abuse.

The risk of privacy violations can also impact government accountability and public trust. The possibility that one’s privacy will be violated by government officials or technology companies might dissuade citizens from getting tested for COVID-19, downloading public health–oriented mobile phone apps, or sharing symptom or location data. More broadly, real or perceived privacy violations might discourage citizens from believing government messaging or complying with government orders regarding COVID-19.

As U.S. public health agencies consider COVID-19-related mobile surveillance programs, they will need to address privacy concerns to encourage broad uptake and protect against privacy harms. Otherwise, COVID-19 mobile surveillance programs likely will be ineffective and the data collected unrepresentative of the situation on the ground….(More)“.

What privacy preserving techniques make possible: for transport authorities


Blog by Georgina Bourke: “The Mayor of London listed cycling and walking as key population health indicators in the London Health Inequalities Strategy. The pandemic has only amplified the need for people to use cycling as a safer and healthier mode of transport. Yet as the majority of cyclists are white, Black communities are less likely to get the health benefits that cycling provides. Groups like Transport for London (TfL) should monitor how different communities cycle and who is excluded. Organisations like the London Office of Technology and Innovation (LOTI) could help boroughs procure privacy preserving technology to help their efforts.

But at the moment, it’s difficult for public organisations to access mobility data held by private companies. One reason is because mobility data is sensitive. Even if you remove identifiers like name and address, there’s still a risk you can reidentify someone by linking different data sets together. This means you could track how an individual moved around a city. I wrote more about the privacy risks with mobility data in a previous blog post. The industry’s awareness of privacy issues in using and sharing mobility data is rising. In the case of Los Angeles Department of Transport’s Mobility Data Specification (LADOT), Uber is concerned about sharing anonymised data because of the privacy risk. Both organisations are now involved in a legal battle to see which has the rights to the data. This might have been avoided if Uber had applied privacy preserving techniques….

Privacy preserving techniques can help mobility providers share important insights with authorities without compromising peoples’ privacy.

Instead of requiring access to all customer trip data, authorities could ask specific questions like, where are the least popular places to cycle? If mobility providers apply techniques like randomised response, an individual’s identity is obscured by the noise added to the data. This means it’s highly unlikely that someone could be reidentified later on. And because this technique requires authorities to ask very specific questions – for randomised response to work, the answer has to be binary, ie Yes or No – authorities will also be practicing data minimisation by default.

It’s easy to imagine transport authorities like TfL combining privacy preserved mobility data from multiple mobility providers to compare insights and measure service provision. They could cross reference the privacy preserved bike trip data with demographic data in the local area to learn how different communities cycle. The first step to addressing inequality is being able to measure it….(More)”.

Public perceptions on data sharing: key insights from the UK and the USA


Paper by Saira Ghafur, Jackie Van Dael, Melanie Leis and Ara Darzi, and Aziz Sheikh: “Data science and artificial intelligence (AI) have the potential to transform the delivery of health care. Health care as a sector, with all of the longitudinal data it holds on patients across their lifetimes, is positioned to take advantage of what data science and AI have to offer. The current COVID-19 pandemic has shown the benefits of sharing data globally to permit a data-driven response through rapid data collection, analysis, modelling, and timely reporting.

Despite its obvious advantages, data sharing is a controversial subject, with researchers and members of the public justifiably concerned about how and why health data are shared. The most common concern is privacy; even when data are (pseudo-)anonymised, there remains a risk that a malicious hacker could, using only a few datapoints, re-identify individuals. For many, it is often unclear whether the risks of data sharing outweigh the benefits.

A series of surveys over recent years indicate that the public holds a range of views about data sharing. Over the past few years, there have been several important data breaches and cyberattacks. This has resulted in patients and the public questioning the safety of their data, including the prospect or risk of their health data being shared with unauthorised third parties.

We surveyed people across the UK and the USA to examine public attitude towards data sharing, data access, and the use of AI in health care. These two countries were chosen as comparators as both are high-income countries that have had substantial national investments in health information technology (IT) with established track records of using data to support health-care planning, delivery, and research. The UK and USA, however, have sharply contrasting models of health-care delivery, making it interesting to observe if these differences affect public attitudes.

Willingness to share anonymised personal health information varied across receiving bodies (figure). The more commercial the purpose of the receiving institution (eg, for an insurance or tech company), the less often respondents were willing to share their anonymised personal health information in both the UK and the USA. Older respondents (≥35 years) in both countries were generally less likely to trust any organisation with their anonymised personal health information than younger respondents (<35 years)…

Despite the benefits of big data and technology in health care, our findings suggest that the rapid development of novel technologies has been received with concern. Growing commodification of patient data has increased awareness of the risks involved in data sharing. There is a need for public standards that secure regulation and transparency of data use and sharing and support patient understanding of how data are used and for what purposes….(More)”.

Interventions to mitigate the racially discriminatory impacts of emerging tech including AI


Joint Civil Society Statement: “As widespread recent protests have highlighted, racial inequality remains an urgent and devastating issue around the world, and this is as true in the context of technology as it is everywhere else. In fact, it may be more so, as algorithmic technologies based on big data are deployed at previously unimaginable scale, reproducing the discriminatory systems that build and govern them.

The undersigned organizations welcome the publication of the report “Racial discrimination and emerging digital technologies: a human rights analysis,” by Special Rapporteur on contemporary forms of racism, racial discrimination, xenophobia and related intolerance, E. Tendayi Achiume, and wish to underscore the importance and timeliness of a number of the recommendations made therein:

  1. Technologies that have had or will have significant racially discriminatory impacts should be banned outright.
    While incremental regulatory approaches may be appropriate in some contexts, where a technology is demonstrably likely to cause racially discriminatory harm, it should not be deployed until that harm can be prevented. Moreover, certain technologies may always have disparate racial impacts, no matter how much their accuracy can be improved. In the present moment, racially discriminatory technologies include facial and affect recognition technology and so-called predictive analytics. We support Special Rapporteur Achiume’s call for mandatory human rights impact assessments as a prerequisite for the adoption of new technologies. We also believe that where such assessments reveal that a technology has a high likelihood of deleterious racially disparate impacts, states should prevent its use through a ban or moratorium. We join the Special Rapporteur in welcoming recent municipal bans, for example, on the use of facial recognition technology, and encourage national governments to adopt similar policies.  Correspondingly, we reiterate our support for states’ imposition of an immediate moratorium on the trade and use of privately developed surveillance tools until such time as states enact appropriate safeguards, and congratulate Special Rapporteur Achiume on joining that call.
  2. Gender mainstreaming and representation along racial, national and other intersecting identities requires radical improvement at all levels of the tech sector.
  3. Technologists cannot solve political, social, and economic problems without the input of domain experts and those personally impacted.
  4. Access to technology is as urgent an issue of racial discrimination as inequity in the design of technologies themselves.
  5. Representative and disaggregated data is a necessary, if not sufficient, condition for racial equity in emerging digital technologies, but it must be collected and managed equitably as well.
  6. States as well as corporations must provide remedies for racial discrimination, including reparations.… (More)”.

The Atlas of Surveillance


Electronic Frontier Foundation: “Law enforcement surveillance isn’t always secret. These technologies can be discovered in news articles and government meeting agendas, in company press releases and social media posts. It just hasn’t been aggregated before.

That’s the starting point for the Atlas of Surveillance, a collaborative effort between the Electronic Frontier Foundation and the University of Nevada, Reno Reynolds School of Journalism. Through a combination of crowdsourcing and data journalism, we are creating the largest-ever repository of information on which law enforcement agencies are using what surveillance technologies. The aim is to generate a resource for journalists, academics, and, most importantly, members of the public to check what’s been purchased locally and how technologies are spreading across the country.

We specifically focused on the most pervasive technologies, including drones, body-worn cameras, face recognition, cell-site simulators, automated license plate readers, predictive policing, camera registries, and gunshot detection. Although we have amassed more than 5,000 datapoints in 3,000 jurisdictions, our research only reveals the tip of the iceberg and underlines the need for journalists and members of the public to continue demanding transparency from criminal justice agencies….(More)”.

Differential Privacy for Privacy-Preserving Data Analysis


Introduction to a Special Blog Series by NIST: “…How can we use data to learn about a population, without learning about specific individuals within the population? Consider these two questions:

  1.  “How many people live in Vermont?”
  2. “How many people named Joe Near live in Vermont?”

The first reveals a property of the whole population, while the second reveals information about one person. We need to be able to learn about trends in the population while preventing the ability to learn anything new about a particular individual. This is the goal of many statistical analyses of data, such as the statistics published by the U.S. Census Bureau, and machine learning more broadly. In each of these settings, models are intended to reveal trends in populations, not reflect information about any single individual.

But how can we answer the first question “How many people live in Vermont?” — which we’ll refer to as a query — while preventing the second question from being answered “How many people name Joe Near live in Vermont?” The most widely used solution is called de-identification (or anonymization), which removes identifying information from the dataset. (We’ll generally assume a dataset contains information collected from many individuals.) Another option is to allow only aggregate queries, such as an average over the data. Unfortunately, we now understand that neither approach actually provides strong privacy protection. De-identified datasets are subject to database-linkage attacks. Aggregation only protects privacy if the groups being aggregated are sufficiently large, and even then, privacy attacks are still possible [1, 2, 3, 4]. 

Differential Privacy

Differential privacy [5, 6] is a mathematical definition of what it means to have privacy. It is not a specific process like de-identification, but a property that a process can have. For example, it is possible to prove that a specific algorithm “satisfies” differential privacy.

Informally, differential privacy guarantees the following for each individual who contributes data for analysis: the output of a differentially private analysis will be roughly the same, whether or not you contribute your data. A differentially private analysis is often called a mechanism, and we denote it ℳ.

Figure 1: Informal Definition of Differential Privacy
Figure 1: Informal Definition of Differential Privacy

Figure 1 illustrates this principle. Answer “A” is computed without Joe’s data, while answer “B” is computed with Joe’s data. Differential privacy says that the two answers should be indistinguishable. This implies that whoever sees the output won’t be able to tell whether or not Joe’s data was used, or what Joe’s data contained.

We control the strength of the privacy guarantee by tuning the privacy parameter ε, also called a privacy loss or privacy budget. The lower the value of the ε parameter, the more indistinguishable the results, and therefore the more each individual’s data is protected.

Figure 2: Formal Definition of Differential Privacy
Figure 2: Formal Definition of Differential Privacy

We can often answer a query with differential privacy by adding some random noise to the query’s answer. The challenge lies in determining where to add the noise and how much to add. One of the most commonly used mechanisms for adding noise is the Laplace mechanism [5, 7]. 

Queries with higher sensitivity require adding more noise in order to satisfy a particular `epsilon` quantity of differential privacy, and this extra noise has the potential to make results less useful. We will describe sensitivity and this tradeoff between privacy and usefulness in more detail in future blog posts….(More)”.