Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation


Paper by Khaled El Emam et al: “There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them.

Objective: The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data.

Methods: A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data.

Results: The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively.

Conclusions: We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data….(More)”.

How the U.S. Military Buys Location Data from Ordinary Apps


Joseph Cox at Vice: “The U.S. military is buying the granular movement data of people around the world, harvested from innocuous-seeming apps, Motherboard has learned. The most popular app among a group Motherboard analyzed connected to this sort of data sale is a Muslim prayer and Quran app that has more than 98 million downloads worldwide. Others include a Muslim dating app, a popular Craigslist app, an app for following storms, and a “level” app that can be used to help, for example, install shelves in a bedroom.

Through public records, interviews with developers, and technical analysis, Motherboard uncovered two separate, parallel data streams that the U.S. military uses, or has used, to obtain location data. One relies on a company called Babel Street, which creates a product called Locate X. U.S. Special Operations Command (USSOCOM), a branch of the military tasked with counterterrorism, counterinsurgency, and special reconnaissance, bought access to Locate X to assist on overseas special forces operations. The other stream is through a company called X-Mode, which obtains location data directly from apps, then sells that data to contractors, and by extension, the military.

The news highlights the opaque location data industry and the fact that the U.S. military, which has infamously used other location data to target drone strikes, is purchasing access to sensitive data. Many of the users of apps involved in the data supply chain are Muslim, which is notable considering that the United States has waged a decades-long war on predominantly Muslim terror groups in the Middle East, and has killed hundreds of thousands of civilians during its military operations in Pakistan, Afghanistan, and Iraq. Motherboard does not know of any specific operations in which this type of app-based location data has been used by the U.S. military.

The apps sending data to X-Mode include Muslim Pro, an app that reminds users when to pray and what direction Mecca is in relation to the user’s current location. The app has been downloaded over 50 million times on Android, according to the Google Play Store, and over 98 million in total across other platforms including iOS, according to Muslim Pro’s website….(More)”.

The responsible use of data for and about children: treading carefully and ethically


Q&A with Stefaan G. Verhulst and Andrew Young …” working in collaboration with UNICEF on an initiative called Responsible Data for Children initiative (RD4C) . Its focus is on data – the risks it poses to children, as well as the opportunities it offers.

You have been working with UNICEF on the Responsible Data for Children initiative (RD4C). What is this and why do we need to be talking more about ‘responsible data’?

To date, the relationship between the datafication of everyday life and child welfare has been under-explored, both by researchers in data ethics and those who work to advance the rights of children. This neglect is a lost opportunity, and also poses a risk to children.

Today’s children are the first generation to grow up amid the rapid datafication of virtually every aspect of social, cultural, political and economic life. This alone calls for greater scrutiny of the role played by data. An entire generation is being datafied, often starting before birth. Every year the average child will have more data collected about them in their lifetime than would a similar child born any year prior. Ironically, humanitarian and development organizations working with children are themselves among the key actors contributing to the increased collection of data. These organizations rely on a wide range of technologies, including biometrics, digital identity systems, remote-sensing technologies, mobile and social media messaging apps, and administrative data systems. The data generated by these tools and platforms inevitably includes potentially sensitive PII data (personally identifiable information) and DII data (demographically identifiable information). All of this begs much closer scrutiny, and a more systematic framework to guide how child-related data is collected, stored, and used.

Towards this aim, we have also been working with the Data for Children Collaborative, based in Edinburgh in establishing innovative and ethical practices around the use of data to improve the lives of children worldwide….(More)”.

Federated Learning for Privacy-Preserving Data Access


Paper by Małgorzata Śmietanka, Hirsh Pithadia and Philip Treleaven: “Federated learning is a pioneering privacy-preserving data technology and also a new machine learning model trained on distributed data sets.

Companies collect huge amounts of historic and real-time data to drive their business and collaborate with other organisations. However, data privacy is becoming increasingly important because of regulations (e.g. EU GDPR) and the need to protect their sensitive and personal data. Companies need to manage data access: firstly within their organizations (so they can control staff access), and secondly protecting raw data when collaborating with third parties. What is more, companies are increasingly looking to ‘monetize’ the data they’ve collected. However, under new legislations, utilising data by different organization is becoming increasingly difficult (Yu, 2016).

Federated learning pioneered by Google is the emerging privacy- preserving data technology and also a new class of distributed machine learning models. This paper discusses federated learning as a solution for privacy-preserving data access and distributed machine learning applied to distributed data sets. It also presents a privacy-preserving federated learning infrastructure….(More)”.

Not fit for Purpose: A critical analysis of the ‘Five Safes’


Paper by Chris Culnane, Benjamin I. P. Rubinstein, and David Watts: “Adopted by government agencies in Australia, New Zealand, and the UK as policy instrument or as embodied into legislation, the ‘Five Safes’ framework aims to manage risks of releasing data derived from personal information. Despite its popularity, the Five Safes has undergone little legal or technical critical analysis. We argue that the Fives Safes is fundamentally flawed: from being disconnected from existing legal protections and appropriation of notions of safety without providing any means to prefer strong technical measures, to viewing disclosure risk as static through time and not requiring repeat assessment. The Five Safes provides little confidence that resulting data sharing is performed using ‘safety’ best practice or for purposes in service of public interest….(More)”.

Third Wave of Open Data


Paper (and site) by Stefaan G. Verhulst, Andrew Young, Andrew J. Zahuranec, Susan Ariel Aaronson, Ania Calderon, and Matt Gee on “How To Accelerate the Re-Use of Data for Public Interest Purposes While Ensuring Data Rights and Community Flourishing”: “The paper begins with a description of earlier waves of open data. Emerging from freedom of information laws adopted over the last half century, the First Wave of Open Data brought about newfound transparency, albeit one only available on request to an audience largely composed of journalists, lawyers, and activists. 

The Second Wave of Open Data, seeking to go beyond access to public records and inspired by the open source movement, called upon national governments to make their data open by default. Yet, this approach too had its limitations, leaving many data silos at the subnational level and in the private sector untouched..

The Third Wave of Open Data seeks to build on earlier successes and take into account lessons learned to help open data realize its transformative potential. Incorporating insights from various data experts, the paper describes the emergence of a Third Wave driven by the following goals:

  1. Publishing with Purpose by matching the supply of data with the demand for it, providing assets that match public interests;
  2. Fostering Partnerships and Data Collaboration by forging relationships with  community-based organizations, NGOs, small businesses, local governments, and others who understand how data can be translated into meaningful real-world action;
  3. Advancing Open Data at the Subnational Level by providing resources to cities, municipalities, states, and provinces to address the lack of subnational information in many regions.
  4. Prioritizing Data Responsibility and Data Rights by understanding the risks of using (and not using) data to promote and preserve the public’s general welfare.

Riding the Wave

Achieving these goals will not be an easy task and will require investments and interventions across the data ecosystem. The paper highlights eight actions that decision and policy makers can take to foster more equitable, impactful benefits… (More) (PDF) “

Consumer Reports Study Finds Marketplace Demand for Privacy and Security


Press Release: “American consumers are increasingly concerned about privacy and data security when purchasing new products and services, which may be a competitive advantage to companies that take action towards these consumer values, a new Consumer Reports study finds. 

The new study, “Privacy Front and Center” from CR’s Digital Lab with support from Omidyar Network, looks at the commercial benefits for companies that differentiate their products based on privacy and data security. The study draws from a nationally representative CR survey of 5,085 adult U.S. residents conducted in February 2020, a meta-analysis of 25 years of public opinion studies, and a conjoint analysis that seeks to quantify how consumers weigh privacy and security in their hardware and software purchasing decisions. 

“This study shows that raising the standard for privacy and security is a win-win for consumers and the companies,” said Ben Moskowitz, the director of the Digital Lab at Consumer Reports. “Given the rapid proliferation of internet connected devices, the rise in data breaches and cyber attacks, and the demand from consumers for heightened privacy and security measures, there’s an undeniable business case for companies to invest in creating more private and secure products.” 

Here are some of the key findings from the study:

  • According to CR’s February 2020 nationally representative survey, 74% of consumers are at least moderately concerned about the privacy of their personal data.
  • Nearly all Americans (96%) agree that more should be done to ensure that companies protect the privacy of consumers.
  • A majority of smart product owners (62%) worry about potential loss of privacy when buying them for their home or family.
  • The privacy/security conscious consumer class seems to include more men and people of color.
  • Experiencing a data breach correlates with a higher willingness to pay for privacy, and 30% of Americans have experienced one.
  • Of the Android users who switched to iPhones, 32% indicated doing so because of Apple’s perceived privacy or security benefits relative to Android….(More)”.

Responsible group data for children


Issue Brief by Andrew Young: “Understanding how and why group data is collected and what can be done to protect children’s rights…While the data protection field largely focuses on individual data harms, it is a focus that obfuscates and exacerbates the risks of data that could put groups of people at risk, such as the residents of a particular village, rather than individuals.

Though not well-represented in the current responsible data literature and policy domains writ large, the challenges group data poses are immense. Moreover, the unique and amplified group data risks facing children are even less scrutinized and understood.

To achieve Responsible Data for Children (RD4C) and ensure effective and legitimate governance of children’s data, government policymakers, data practitioners, and institutional decision makers need to ensure children’s group data are a core consideration in all relevant policies, procedures, and practices….(More)”. (See also Responsible Data for Children).

The Cruel New Era of Data-Driven Deportation


Article by Alvaro M. Bedoya: “For a long time, mass deportations were a small-data affair, driven by tips, one-off investigations, or animus-driven hunches. But beginning under George W. Bush, and expanding under Barack Obama, ICE leadership started to reap the benefits of Big Data. The centerpiece of that shift was the “Secure Communities” program, which gathered the fingerprints of arrestees at local and state jails across the nation and compared them with immigration records. That program quickly became a major driver for interior deportations. But ICE wanted more data. The agency had long tapped into driver address records through law enforcement networks. Eyeing the breadth of DMV databases, agents began to ask state officials to run face recognition searches on driver photos against the photos of undocumented people. In Utah, for example, ICE officers requested hundreds of face searches starting in late 2015. Many immigrants avoid contact with any government agency, even the DMV, but they can’t go without heat, electricity, or water; ICE aimed to find them, too. So, that same year, ICE paid for access to a private database that includes the addresses of customers from 80 national and regional electric, cable, gas, and telephone companies.

Amid this bonanza, at least, the Obama administration still acknowledged red lines. Some data were too invasive, some uses too immoral. Under Donald Trump, these limits fell away.

In 2017, breaking with prior practice, ICE started to use data from interviews with scared, detained kids and their relatives to find and arrest more than 500 sponsors who stepped forward to take in the children. At the same time, ICE announced a plan for a social media monitoring program that would use artificial intelligence to automatically flag 10,000 people per month for deportation investigations. (It was scuttled only when computer scientists helpfully indicated that the proposed system was impossible.) The next year, ICE secured access to 5 billion license plate scans from public parking lots and roadways, a hoard that tracks the drives of 60 percent of Americans—an initiative blocked by Department of Homeland Security leadership four years earlier. In August, the agency cut a deal with Clearview AI, whose technology identifies people by comparing their faces not to millions of driver photos, but to 3 billion images from social media and other sites. This is a new era of immigrant surveillance: ICE has transformed from an agency that tracks some people sometimes to an agency that can track anyone at any time….(More)”.

Ethical Challenges and Opportunities Associated With the Ability to Perform Medical Screening From Interactions With Search Engines


Viewpoint by Elad Yom-Tov and Yuval Cherlow: “Recent research has shown the efficacy of screening for serious medical conditions from data collected while people interact with online services. In particular, queries to search engines and the interactions with them were shown to be advantageous for screening a range of conditions including diabetes, several forms of cancer, eating disorders, and depression. These screening abilities offer unique advantages in that they can serve a broad strata of the society, including people in underserved populations and in countries with poor access to medical services. However, these advantages need to be balanced against the potential harm to privacy, autonomy, and nonmaleficence, which are recognized as the cornerstones of ethical medical care. Here, we discuss these opportunities and challenges, both when collecting data to develop online screening services and when deploying them. We offer several solutions that balance the advantages of these services with the ethical challenges they pose….(More)”.