The Complexities of Differential Privacy for Survey Data


Paper by Jörg Drechsler & James Bailie: “The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoing project funded by the U.S. Census Bureau that is exploring the possibilities and limitations of DP for survey data. Specifically, we identify five aspects that need to be considered when adopting DP in the survey context: the multi-staged nature of data production; the limited privacy amplification from complex sampling designs; the implications of survey-weighted estimates; the weighting adjustments for nonresponse and other data deficiencies, and the imputation of missing values. We summarize the project’s key findings with respect to each of these aspects and also discuss some of the challenges that still need to be addressed before DP could become the new data protection standard at statistical agencies…(More)”.

Align or fail: How economics shape successful data sharing


Blog by Federico Bartolomucci: “…The conceptual distinctions between different data sharing models are mostly based on one fundamental element: the economic nature of data and its value. 

Open data projects operate under the assumption that data is a non-rival (i.e. can be used by multiple people at the same time) and a non-excludable asset (i.e. anyone can use it, similar to a public good like roads or the air we breathe). This means that data can be shared with everyone, for any use, without losing its market and competitive value. The Humanitarian Data Exchange platform is a great example that allows organizations to share over 19,000 open data sets on all aspects of humanitarian response with others.

Data collaboratives treat data as an excludable asset that some people may be excluded from accessing (i.e. a ‘club good’, like a movie theater) and therefore share it only among a restricted pool of actors. At the same time, they overcome the rival nature of this data set up by linking its use to a specific purpose. These work best by giving the actors a voice in choosing the purpose for which the data will be used, and through specific agreements and governance bodies that ensure that those contributing data will not have their competitive position harmed, therefore incentivizing them to engage. A good example of this is the California Data Collaborative, which uses data from different actors in the water sector to develop high-level analysis on water distribution to guide policy, planning, and operations for water districts in the state of California. 

Data ecosystems work by activating market mechanisms around data exchange to overcome reluctance to share data, rather than relying solely on its purpose of use. This means that actors can choose to share their data in exchange for compensation, be it monetary or in alternate forms such as other data. In this way, the compensation balances the potential loss of competitive advantage created by the sharing of a rival asset, as well as the costs and risks of sharing. The Enershare initiative aims to establish a marketplace utilizing blockchain and smart contracts to facilitate data exchange in the energy sector. The platform is based on a compensation system, which can be non-monetary, for exchanging assets and resources related to data (such as datasets, algorithms, and models) with energy assets and services (like heating system maintenance or the transfer of surplus locally self-produced energy).

These different models of data sharing have different operational implications…(More)”.

Private sector trust in data sharing: enablers in the European Union


Paper by Jaime Bernal: “Enabling private sector trust stands as a critical policy challenge for the success of the EU Data Governance Act and Data Act in promoting data sharing to address societal challenges. This paper attributes the widespread trust deficit to the unmanageable uncertainty that arises from businesses’ limited usage control to protect their interests in the face of unacceptable perceived risks. For example, a firm may hesitate to share its data with others in case it is leaked and falls into the hands of business competitors. To illustrate this impasse, competition, privacy, and reputational risks are introduced, respectively, in the context of three suboptimal approaches to data sharing: data marketplaces, data collaboratives, and data philanthropy. The paper proceeds by analyzing seven trust-enabling mechanisms comprised of technological, legal, and organizational elements to balance trust, risk, and control and assessing their capacity to operate in a fair, equitable, and transparent manner. Finally, the paper examines the regulatory context in the EU and the advantages and limitations of voluntary and mandatory data sharing, concluding that an approach that effectively balances the two should be pursued…(More)”.

Collaboration in Healthcare: Implications of Data Sharing for Secondary Use in the European Union


Paper by Fanni Kertesz: “The European healthcare sector is transforming toward patient-centred and value-based healthcare delivery. The European Health Data Space (EHDS) Regulation aims to unlock the potential of health data by establishing a single market for its primary and secondary use. This paper examines the legal challenges associated with the secondary use of health data within the EHDS and offers recommendations for improvement. Key issues include the compatibility between the EHDS and the General Data Protection Regulation (GDPR), barriers to cross-border data sharing, and intellectual property concerns. Resolving these challenges is essential for realising the full potential of health data and advancing healthcare research and innovation within the EU…(More)”.

On Slicks and Satellites: An Open Source Guide to Marine Oil Spill Detection


Article by Wim Zwijnenburg: “The sheer scale of ocean oil pollution is staggering. In Europe, a suspected 3,000 major illegal oil dumps take place annually, with an estimated release of between 15,000 and 60,000 tonnes of oil ending up in the North Sea. In the Mediterranean, figures provided by the Regional Marine Pollution Emergency Response Centre estimate there are 1,500 to 2,000 oil spills every year.

The impact of any single oil spill on a marine or coastal ecosystem can be devastating and long-lasting. Animals such as birds, turtles, dolphins and otters can suffer from ingesting or inhaling oil, as well as getting stuck in the slick. The loss of water and soil quality can be toxic to both flora and fauna. Heavy metals enter the food chain, poisoning everything from plankton to shellfish, which in turn affects the livelihoods of coastal communities dependent on fishing and tourism.

However, with a wealth of open source earth observation tools at our fingertips, during such environmental disasters it’s possible for us to identify and monitor these spills, highlight at-risk areas, and even hold perpetrators accountable. …

There are several different types of remote sensing sensors we can use for collecting data about the Earth’s surface. In this article we’ll focus on two: optical and radar sensors. 

Optical imagery captures the broad light spectrum reflected from the Earth, also known as passive remote sensing. In contrast, Synthetic Aperture Radar (SAR) uses active remote sensing, sending radio waves down to the Earth’s surface and capturing them as they are reflected back. Any change in the reflection can indicate a change on ground, which can then be investigated. For more background, see Bellingcat contributor Ollie Ballinger’s Remote Sensing for OSINT Guide…(More)”.

Using internet search data as part of medical research


Blog by Susan Thomas and Matthew Thompson: “…In the UK, almost 50 million health-related searches are made using Google per year. Globally there are 100s of millions of health-related searches every day. And, of course, people are doing these searches in real-time, looking for answers to their concerns in the moment. It’s also possible that, even if people aren’t noticing and searching about changes to their health, their behaviour is changing. Maybe they are searching more at night because they are having difficulty sleeping or maybe they are spending more (or less) time online. Maybe an individual’s search history could actually be really useful for researchers. This realisation has led medical researchers to start to explore whether individuals’ online search activity could help provide those subtle, almost unnoticeable signals that point to the beginning of a serious illness.

Our recent review found 23 studies have been published so far that have done exactly this. These studies suggest that online search activity among people later diagnosed with a variety of conditions ranging from pancreatic cancer and stroke to mood disorders, was different to people who did not have one of these conditions.

One of these studies was published by researchers at Imperial College London, who used online search activity to identify signals of women with gynaecological malignancies. They found that women with malignant (e.g. ovarian cancer) and benign conditions had different search patterns, up to two months prior to a GP referral. 

Pause for a moment, and think about what this could mean. Ovarian cancer is one of the most devastating cancers women get. It’s desperately hard to detect early – and yet there are signals of this cancer visible in women’s internet searches months before diagnosis?…(More)”.

Data sovereignty for local governments. Considerations and enablers


Report by JRC Data sovereignty for local governments refers to a capacity to control and/or access data, and to foster a digital transformation aligned with societal values and EU Commission political priorities. Data sovereignty clauses are an instrument that local governments may use to compel companies to share data of public interest. Albeit promising, little is known about the peculiarities of this instrument and how it has been implemented so far. This policy brief aims at filling the gap by systematising existing knowledge and providing policy-relevant recommendations for its wider implementation…(More)”.

Community consent: neither a ceiling nor a floor


Article by Jasmine McNealy: “The 23andMe breach and the Golden State Killer case are two of the more “flashy” cases, but questions of consent, especially the consent of all of those affected by biodata collection and analysis in more mundane or routine health and medical research projects, are just as important. The communities of people affected have expectations about their privacy and the possible impacts of inferences that could be made about them in data processing systems. Researchers must, then, acquire community consent when attempting to work with networked biodata. 

Several benefits of community consent exist, especially for marginalized and vulnerable populations. These benefits include:

  • Ensuring that information about the research project spreads throughout the community,
  • Removing potential barriers that might be created by resistance from community members,
  • Alleviating the possible concerns of individuals about the perspectives of community leaders, and 
  • Allowing the recruitment of participants using methods most salient to the community.

But community consent does not replace individual consent and limits exist for both community and individual consent. Therefore, within the context of a biorepository, understanding whether community consent might be a ceiling or a floor requires examining governance and autonomy…(More)”.

The Data That Powers A.I. Is Disappearing Fast


Article by Kevin Roose: “For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources…(More)”.

Exploring Digital Biomarkers for Depression Using Mobile Technology


Paper by Yuezhou Zhang et al: “With the advent of ubiquitous sensors and mobile technologies, wearables and smartphones offer a cost-effective means for monitoring mental health conditions, particularly depression. These devices enable the continuous collection of behavioral data, providing novel insights into the daily manifestations of depressive symptoms.

We found several significant links between depression severity and various behavioral biomarkers: elevated depression levels were associated with diminished sleep quality (assessed through Fitbit metrics), reduced sociability (approximated by Bluetooth), decreased levels of physical activity (quantified by step counts and GPS data), a slower cadence of daily walking (captured by smartphone accelerometers), and disturbances in circadian rhythms (analyzed across various data streams).
Leveraging digital biomarkers for assessing and continuously monitoring depression introduces a new paradigm in early detection and development of customized intervention strategies. Findings from these studies not only enhance our comprehension of depression in real-world settings but also underscore the potential of mobile technologies in the prevention and management of mental health issues…(More)”