Protecting the Confidentiality of America’s Statistics: Adopting Modern Disclosure Avoidance Methods at the Census Bureau


John Abowd at US Census: “…Throughout our history, we have been leaders in statistical data protection, which we call disclosure avoidance. Other statistical agencies use the terms “disclosure limitation” and “disclosure control.” These terms are all synonymous. Disclosure avoidance methods have evolved since the censuses of the early 1800s, when the only protection used was simply removing names. Executive orders, and a series of laws modified the legal basis for these protections, which were finally codified in the 1954 Census Act (13 U.S.C. Sections 8(b) and 9). We have continually added better and stronger protections to keep the data we publish anonymous and underlying records confidential.

However, historical methods cannot completely defend against the threats posed by today’s technology. Growth in computing power, advances in mathematics, and easy access to large, public databases pose a significant threat to confidentiality. These forces have made it possible for sophisticated users to ferret out common data points between databases using only our published statistics. If left unchecked, those users might be able to stitch together these common threads to identify the people or businesses behind the statistics as was done in the case of the Netflix Challenge.

The Census Bureau has been addressing these issues from every feasible angle and changing rapidly with the times to ensure that we protect the data our census and survey respondents provide us. We are doing this by moving to a new, advanced, and far more powerful confidentiality protection system, which uses a rigorous mathematical process that protects respondents’ information and identity in all of our publications.

The new tool is based on the concept known in scientific and academic circles as “differential privacy.” It is also called “formal privacy” because it provides provable mathematical guarantees, similar to those found in modern cryptography, about the confidentiality protections that can be independently verified without compromising the underlying protections.

“Differential privacy” is based on the cryptographic principle that an attacker should not be able to learn any more about you from the statistics we publish using your data than from statistics that did not use your data. After tabulating the data, we apply carefully constructed algorithms to modify the statistics in a way that protects individuals while continuing to yield accurate results. We assume that everyone’s data are vulnerable and provide the same strong, state-of-the-art protection to every record in our database.

The Census Bureau did not invent the science behind differential privacy. However, we were the first organization anywhere to use it when we incorporated differential privacy into the OnTheMap application in 2008. It was used in this event to protect block-level residential population data. Recently, Google, Apple, Microsoft, and Uber have all followed the Census Bureau’s lead, adopting differentially privacy systems as the standard for protecting user data confidentiality inside their browsers (Chrome), products (iPhones), operating systems (Windows 10), and apps (Uber)….(More)”.

Origin Privacy: Protecting Privacy in the Big-Data Era


Paper by Helen Nissenbaum, Sebastian Benthall, Anupam Datta, Michael Carl Tschantz, and Piot Mardziel: “Machine learning over big data poses challenges for our conceptualization of privacy. Such techniques can discover surprising and counteractive associations that take innocent looking data and turns it into important inferences about a person. For example, the buying carbon monoxide monitors has been linked to paying credit card bills, while buying chrome-skull car accessories predicts not doing so. Also, Target may have used the buying of scent-free hand lotion and vitamins as a sign that the buyer is pregnant. If we take pregnancy status to be private and assume that we should prohibit the sharing information that can reveal that fact, then we have created an unworkable notion of privacy, one in which sharing any scrap of data may violate privacy.

Prior technical specifications of privacy depend on the classification of certain types of information as private or sensitive; privacy policies in these frameworks limit access to data that allow inference of this sensitive information. As the above examples show, today’s data rich world creates a new kind of problem: it is difficult if not impossible to guarantee that information does notallow inference of sensitive topics. This makes information flow rules based on information topic unstable.

We address the problem of providing a workable definition of private data that takes into account emerging threats to privacy from large-scale data collection systems. We build on Contextual Integrity and its claim that privacy is appropriate information flow, or flow according to socially or legally specified rules.

As in other adaptations of Contextual Integrity (CI) to computer science, the parameterization of social norms in CI is translated into a logical specification. In this work, we depart from CI by considering rules that restrict information flow based on its origin and provenance, instead of on it’s type, topic, or subject.

We call this concept of privacy as adherence to origin-based rules Origin Privacy. Origin Privacy rules can be found in some existing data protection laws. This motivates the computational implementation of origin-based rules for the simple purpose of compliance engineering. We also formally model origin privacy to determine what security properties it guarantees relative to the concerns that motivate it….(More)”.

Sharing the benefits: How to use data effectively in the public sector


Report by Sarah Timmis, Luke Heselwood and Eleonora Harwich (for Reform UK): “This report demonstrates the potential of data sharing to transform the delivery of public services and improve outcomes for citizens. It explores how government can overcome various challenges to ‘get data right’ and enable better use of personal data within and between public-sector organisations.

Ambition meets reality

Government is set on using data more effectively to help deliver better public services. Better use of data can improve the design, efficiency and outcomes of services. For example, sharing data digitally between GPs and hospitals can enable early identification of patients most at risk of hospital admission, which has reduced admissions by up to 30 per cent in Somerset. Bristol’s Homeless Health Service allows access to medical, psychiatric, social and prison data, helping to provide a clearer picture of the complex issues facing the city’s homeless population. However, government has not yet created a clear data infrastructure, which would allow data to be shared across multiple public services, meaning efforts on the ground have not always delivered results.

The data: sticking points

Several technical challenges must be overcome to create the right data infrastructure. Individual pieces of data must be presented in standard formats to enable sharing within and across services. Data quality can be improved at the point of data collection, through better monitoring of data quality and standards within public-sector organisations and through data-curation-processes. Personal data also needs to be presented in a given format so linking data is possible in certain instances to identify individuals. Interoperability issues and legacy systems act as significant barriers to data linking. The London Metropolitan Police alone use 750 different systems, many of which are incompatible. Technical solutions, such as Application Programming Interfaces (APIs) can be overlaid on top of legacy systems to improve interoperability and enable data sharing. However, this is only possible with the right standards and a solid new data model. To encourage competition and improve interoperability in the longer term, procurement rules should make interoperability a prerequisite for competing companies, allowing customers to integrate their choices of the most appropriate products from different vendors.

Building trustworthiness

The ability to share data at scale through the internet has brought new threats to the security and privacy of personal information that amplifies the need for trust between government and citizens and across government departments. Currently, just 9 per cent of people feel that the Government has their best interests at heart when data sharing, and only 15 per cent are confident that government organisations would deal well with a cyber-attack. Considering attitudes towards data sharing are time and context dependent, better engagement with citizens and clearer explanations of when and why data is used can help build confidence. Auditability is also key to help people and organisations track how data is used to ensure every interaction with personal data is auditable, transparent and secure. …(More)”.

Remembering and Forgetting in the Digital Age


Book by Thouvenin, Florent (et al.): “… examines the fundamental question of how legislators and other rule-makers should handle remembering and forgetting information (especially personally identifiable information) in the digital age. It encompasses such topics as privacy, data protection, individual and collective memory, and the right to be forgotten when considering data storage, processing and deletion. The authors argue in support of maintaining the new digital default, that (personally identifiable) information should be remembered rather than forgotten.

The book offers guidelines for legislators as well as private and public organizations on how to make decisions on remembering and forgetting personally identifiable information in the digital age. It draws on three main perspectives: law, based on a comprehensive analysis of Swiss law that serves as an example; technology, specifically search engines, internet archives, social media and the mobile internet; and an interdisciplinary perspective with contributions from various disciplines such as philosophy, anthropology, sociology, psychology, and economics, amongst others.. Thanks to this multifaceted approach, readers will benefit from a holistic view of the informational phenomenon of “remembering and forgetting”.

This book will appeal to lawyers, philosophers, sociologists, historians, economists, anthropologists, and psychologists among many others. Such wide appeal is due to its rich and interdisciplinary approach to the challenges for individuals and society at large with regard to remembering and forgetting in the digital age…(More)”

Better ways to measure the new economy


Valerie Hellinghausen and Evan Absher at Kauffman Foundation: “The old measure of “jobs numbers” as an economic indicator is shifting to new metrics to measure a new economy.

With more communities embracing inclusive entrepreneurial ecosystems as the new model of economic development, entrepreneurs, ecosystem builders, and government agencies – at all levels – need to work together on data-driven initiatives. While established measures still have a place, new metrics have the potential to deliver the timely and granular information that is more useful at the local level….

Three better ways to measure the new economy:

  1. National and local datasets:Numbers used to discuss the economy are national level and usually not very timely. These numbers are useful to understand large trends, but fail to capture local realities. One way to better measure local economies is to use local administrative datasets. There are many obstacles with this approach, but the idea is gaining interest. Data infrastructure, policies, and projects are building connections between local and national agencies. Joining different levels of government data will provide national scale and local specificity.
  1. Private and public data:The words private and public typically reflect privacy issues, but there is another public and private dimension. Public institutions possess vast amounts of data, but so do private companies. For instance, sites like PayPal, Square, Amazon, and Etsy possess data that could provide real-time assessment of an individual company’s financial health. The concept of credit and risk could be expanded to benefit those currently underserved, if combined with local administrative information like tax, wage, and banking data. Fair and open use of private data could open credit to currently underfunded entrepreneurs.
  1. New metrics:Developing connections between different datasets will result in new metrics of entrepreneurial activity: metrics that measure human connection, social capital, community creativity, and quality of life. Metrics that capture economic activity at the community level and in real time. For example, the Kauffman Foundation has funded research that uses labor data from private job-listing sites to better understand the match between the workforce entrepreneurs need and the workforce available within the immediate community. But new metrics are not enough, they must connect to the final goal of economic independence. Using new metrics to help ecosystems understand how policies and programs impact entrepreneurship is the final step to measuring local economies….(More)”.

Self-Invasion And The Invaded Self


Rochelle Gurstein in the Baffler: “WHAT DO WE LOSE WHEN WE LOSE OUR PRIVACY? This question has become increasingly difficult to answer, living as we do in a society that offers boundless opportunities for men and women to expose themselves (in all dimensions of that word) as never before, to commit what are essentially self-invasions of privacy. Although this is a new phenomenon, it has become as ubiquitous as it is quotidian, and for that reason, it is perhaps one of the most telling signs of our time. To get a sense of the sheer range of unconscious exhibitionism, we need only think of the popularity of reality TV shows, addiction-recovery memoirs, and cancer diaries. Then there are the banal but even more conspicuous varieties, like soaring, all-glass luxury apartment buildings and hotels in which inhabitants display themselves in all phases of their private lives to the casual glance of thousands of city walkers below. Or the incessant sound of people talking loudly—sometimes gossiping, sometimes crying—on their cell phones, broadcasting to total strangers the intimate details of their lives.

And, of course, there are now unprecedented opportunities for violating one’s own privacy, furnished by the technology of the internet. The results are everywhere, from selfies and Instagrammed trivia to the almost automatic, everyday activity of Facebook users registering their personal “likes” and preferences. (As we recently learned, this online pastime is nowhere near as private as we had been led to believe; more than fifty million users’ idly generated “data” was “harvested” by Cambridge Analytica to make “personality profiles” that were then used to target voters with advertisements from Donald Trump’s presidential campaign.)

Beyond these branded and aggressively marketed forums for self-invasions of privacy there are all the giddy, salacious forms that circulate in graphic images and words online—the sort that led not so long ago to the downfall of Anthony Weiner. The mania for attention of any kind is so pervasive—and the invasion of privacy so nonchalant—that many of us no longer notice, let alone mind, what in the past would have been experienced as insolent violations of privacy….(More)”.

Trust, Security, and Privacy in Crowdsourcing


Guest Editorial to Special Issue of IEEE Internet of Things Journal: “As we become increasingly reliant on intelligent, interconnected devices in every aspect of our lives, critical trust, security, and privacy concerns are raised as well.

First, the sensing data provided by individual participants is not always reliable. It may be noisy or even faked due to various reasons, such as poor sensor quality, lack of sensor calibration, background noise, context impact, mobility, incomplete view of observations, or malicious attacks. The crowdsourcing applications should be able to evaluate the trustworthiness of collected data in order to filter out the noisy and fake data that may disturb or intrude a crowdsourcing system. Second, providing data (e.g., photographs taken with personal mobile devices) or using IoT applications may compromise data providers’ personal data privacy (e.g., location, trajectory, and activity privacy) and identity privacy. Therefore, it becomes essential to assess the trust of the data while preserving the data providers’ privacy. Third, data analytics and mining in crowdsourcing may disclose the privacy of data providers or related entities to unauthorized parities, which lowers the willingness of participants to contribute to the crowdsourcing system, impacts system acceptance, and greatly impedes its further development. Fourth, the identities of data providers could be forged by malicious attackers to intrude the whole crowdsourcing system. In this context, trust, security, and privacy start to attract a special attention in order to achieve high quality of service in each step of crowdsourcing with regard to data collection, transmission, selection, processing, analysis and mining, as well as utilization.

Trust, security, and privacy in crowdsourcing receives increasing attention. Many methods have been proposed to protect privacy in the process of data collection and processing. For example, data perturbation can be adopted to hide the real data values during data collection. When preprocessing the collected data, data anonymization (e.g., k-anonymization) and fusion can be applied to break the links between the data and their sources/providers. In application layer, anonymity is used to mask the real identities of data sources/providers. To enable privacy-preserving data mining, secure multiparty computation (SMC) and homomorphic encryption provide options for protecting raw data when multiple parties jointly run a data mining algorithm. Through cryptographic techniques, no party knows anything else than its own input and expected results. For data truth discovery, applicable solutions include correlation-based data quality analysis and trust evaluation of data sources. But current solutions are still imperfect, incomprehensive, and inefficient….(More)”.

Countries Can Learn from France’s Plan for Public Interest Data and AI


Nick Wallace at the Center for Data Innovation: “French President Emmanuel Macron recently endorsed a national AI strategy that includes plans for the French state to make public and private sector datasets available for reuse by others in applications of artificial intelligence (AI) that serve the public interest, such as for healthcare or environmental protection. Although this strategy fails to set out how the French government should promote widespread use of AI throughout the economy, it will nevertheless give a boost to AI in some areas, particularly public services. Furthermore, the plan for promoting the wider reuse of datasets, particularly in areas where the government already calls most of the shots, is a practical idea that other countries should consider as they develop their own comprehensive AI strategies.

The French strategy, drafted by mathematician and Member of Parliament Cédric Villani, calls for legislation to mandate repurposing both public and private sector data, including personal data, to enable public-interest uses of AI by government or others, depending on the sensitivity of the data. For example, public health services could use data generated by Internet of Things (IoT) devices to help doctors better treat and diagnose patients. Researchers could use data captured by motorway CCTV to train driverless cars. Energy distributors could manage peaks and troughs in demand using data from smart meters.

Repurposed data held by private companies could be made publicly available, shared with other companies, or processed securely by the public sector, depending on the extent to which sharing the data presents privacy risks or undermines competition. The report suggests that the government would not require companies to share data publicly when doing so would impact legitimate business interests, nor would it require that any personal data be made public. Instead, Dr. Villani argues that, if wider data sharing would do unreasonable damage to a company’s commercial interests, it may be appropriate to only give public authorities access to the data. But where the stakes are lower, companies could be required to share the data more widely, to maximize reuse. Villani rightly argues that it is virtually impossible to come up with generalizable rules for how data should be shared that would work across all sectors. Instead, he argues for a sector-specific approach to determining how and when data should be shared.

After making the case for state-mandated repurposing of data, the report goes on to highlight four key sectors as priorities: health, transport, the environment, and defense. Since these all have clear implications for the public interest, France can create national laws authorizing extensive repurposing of personal data without violating the General Data Protection Regulation (GDPR) which allows national laws that permit the repurposing of personal data where it serves the public interest. The French strategy is the first clear effort by an EU member state to proactively use this clause in aid of national efforts to bolster AI….(More)”.

Mapping the Privacy-Utility Tradeoff in Mobile Phone Data for Development


Paper by Alejandro Noriega-Campero, Alex Rutherford, Oren Lederman, Yves A. de Montjoye, and Alex Pentland: “Today’s age of data holds high potential to enhance the way we pursue and monitor progress in the fields of development and humanitarian action. We study the relation between data utility and privacy risk in large-scale behavioral data, focusing on mobile phone metadata as paradigmatic domain. To measure utility, we survey experts about the value of mobile phone metadata at various spatial and temporal granularity levels. To measure privacy, we propose a formal and intuitive measure of reidentification riskthe information ratioand compute it at each granularity level. Our results confirm the existence of a stark tradeoff between data utility and reidentifiability, where the most valuable datasets are also most prone to reidentification. When data is specified at ZIP-code and hourly levels, outside knowledge of only 7% of a person’s data suffices for reidentification and retrieval of the remaining 93%. In contrast, in the least valuable dataset, specified at municipality and daily levels, reidentification requires on average outside knowledge of 51%, or 31 data points, of a person’s data to retrieve the remaining 49%. Overall, our findings show that coarsening data directly erodes its value, and highlight the need for using data-coarsening, not as stand-alone mechanism, but in combination with data-sharing models that provide adjustable degrees of accountability and security….(More)”.

A roadmap for restoring trust in Big Data


Mark Lawler et al in the Lancet: “The fallout from the Cambridge Analytica–Facebook scandal marks a significant inflection point in the public’s trust concerning Big Data. The health-science community must use this crisis-in-confidence to redouble its commitment to talk openly and transparently about benefits and risks and to act decisively to deliver robust effective governance frameworks, under which personal health data can be responsibly used. Activities such as the Innovative Medicines Initiative’s Big Data for Better Outcomes emphasise how a more granular data-driven understanding of human diseases including cancer could underpin innovative therapeutic intervention.
 Health Data Research UK is developing national research expertise and infrastructure to maximise the value of health data science for the National Health Service and ultimately British citizens.
Comprehensive data analytics are crucial to national programmes such as the US Cancer Moonshot, the UK’s 100 000 Genomes project, and other national genomics programmes. Cancer Core Europe, a research partnership between seven leading European oncology centres, has personal data sharing at its core. The Global Alliance for Genomics and Health recently highlighted the need for a global cancer knowledge network to drive evidence-based solutions for a disease that kills more than 8·7 million citizens annually worldwide. These activities risk being fatally undermined by the recent data-harvesting controversy.
We need to restore the public’s trust in data science and emphasise its positive contribution in addressing global health and societal challenges. An opportunity to affirm the value of data science in Europe was afforded by Digital Day 2018, which took place on April 10, 2018, in Brussels, and where European Health Ministers signed a declaration of support to link existing or future genomic databanks across the EU, through the Million European Genomes Alliance.
So how do we address evolving challenges in analysis, sharing, and storage of information, ensure transparency and confidentiality, and restore public trust? We must articulate a clear Social Contract, where citizens (as data donors) are at the heart of decision-making. We need to demonstrate integrity, honesty, and transparency as to what happens to data and what level of control people can, or cannot, expect. We must embed ethical rigour in all our data-driven processes. The Framework for Responsible Sharing of Genomic and Health Related Data represents a practical global approach, promoting effective and ethical sharing and use of research or patient data, while safeguarding individual privacy through secure and accountable data transfer…(More)”.