Anonymization: The imperfect science of using data while preserving privacy


Paper by Andrea Gadotti et al: “Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today…(More)”.

The Data That Powers A.I. Is Disappearing Fast


Article by Kevin Roose: “For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources…(More)”.

Governance of deliberative mini-publics: emerging consensus and divergent views


Paper by Lucy J. Parry, Nicole Curato, and , and John S. Dryzek: “Deliberative mini-publics are forums for citizen deliberation composed of randomly selected citizens convened to yield policy recommendations. These forums have proliferated in recent years but there are no generally accepted standards to govern their practice. Should there be? We answer this question by bringing the scholarly literature on citizen deliberation into dialogue with the lived experience of the people who study, design and implement mini-publics. We use Q methodology to locate five distinct perspectives on the integrity of mini-publics, and map the structure of agreement and dispute across them. We find that, across the five viewpoints, there is emerging consensus as well as divergence on integrity issues, with disagreement over what might be gained or lost by adapting common standards of practice, and possible sources of integrity risks. This article provides an empirical foundation for further discussion on integrity standards in the future…(More)”.

Precision public health in the era of genomics and big data


Paper by Megan C. Roberts et al: “Precision public health (PPH) considers the interplay between genetics, lifestyle and the environment to improve disease prevention, diagnosis and treatment on a population level—thereby delivering the right interventions to the right populations at the right time. In this Review, we explore the concept of PPH as the next generation of public health. We discuss the historical context of using individual-level data in public health interventions and examine recent advancements in how data from human and pathogen genomics and social, behavioral and environmental research, as well as artificial intelligence, have transformed public health. Real-world examples of PPH are discussed, emphasizing how these approaches are becoming a mainstay in public health, as well as outstanding challenges in their development, implementation and sustainability. Data sciences, ethical, legal and social implications research, capacity building, equity research and implementation science will have a crucial role in realizing the potential for ‘precision’ to enhance traditional public health approaches…(More)”.

Integrating Artificial Intelligence into Citizens’ Assemblies: Benefits, Concerns and Future Pathways


Paper by Sammy McKinney: “Interest in how Artificial Intelligence (AI) could be used within citizens’ assemblies (CAs) is emerging amongst scholars and practitioners alike. In this paper, I make four contributions at the intersection of these burgeoning fields. First, I propose an analytical framework to guide evaluations of the benefits and limitations of AI applications in CAs. Second, I map out eleven ways that AI, especially large language models (LLMs), could be used across a CAs full lifecycle. This introduces novel ideas for AI integration into the literature and synthesises existing proposals to provide the most detailed analytical breakdown of AI applications in CAs to date. Third, drawing on relevant literature, four key informant interviews, and the Global Assembly on the Ecological and Climate crisis as a case study, I apply my analytical framework to assess the desirability of each application. This provides insight into how AI could be deployed to address existing  challenges facing CAs today as well as the concerns that arise with AI integration. Fourth, bringing my analyses together, I argue that AI integration into CAs brings the potential to enhance their democratic quality and institutional capacity, but realising this requires the deliberative community to proceed cautiously, effectively navigate challenging trade-offs, and mitigate important concerns that arise with AI integration. Ultimately, this paper provides a foundation that can guide future research concerning AI integration into CAs and other forms of democratic innovation…(More)”.

The Great Scrape: The Clash Between Scraping and Privacy


Paper by Daniel J. Solove and Woodrow Hartzog: “Artificial intelligence (AI) systems depend on massive quantities of data, often gathered by “scraping” – the automated extraction of large amounts of data from the internet. A great deal of scraped data is about people. This personal data provides the grist for AI tools such as facial recognition, deep fakes, and generative AI. Although scraping enables web searching, archival, and meaningful scientific research, scraping for AI can also be objectionable or even harmful to individuals and society.

Organizations are scraping at an escalating pace and scale, even though many privacy laws are seemingly incongruous with the practice. In this Article, we contend that scraping must undergo a serious reckoning with privacy law.  Scraping violates nearly all of the key principles in privacy laws, including fairness; individual rights and control; transparency; consent; purpose specification and secondary use restrictions; data minimization; onward transfer; and data security. With scraping, data protection laws built around these requirements are ignored.

Scraping has evaded a reckoning with privacy law largely because scrapers act as if all publicly available data were free for the taking. But the public availability of scraped data shouldn’t give scrapers a free pass. Privacy law regularly protects publicly available data, and privacy principles are implicated even when personal data is accessible to others.

This Article explores the fundamental tension between scraping and privacy law. With the zealous pursuit and astronomical growth of AI, we are in the midst of what we call the “great scrape.” There must now be a great reconciliation…(More)”.

(Almost) 200 Years of News-Based Economic Sentiment


Paper by Jules H. van Binsbergen, Svetlana Bryzgalova, Mayukh Mukhopadhyay & Varun Sharma: “Using text from 200 million pages of 13,000 US local newspapers and machine learning methods, we construct a 170-year-long measure of economic sentiment at the country and state levels, that expands existing measures in both the time series (by more than a century) and the cross-section. Our measure predicts GDP (both nationally and locally), consumption, and employment growth, even after controlling for commonly-used predictors, as well as monetary policy decisions. Our measure is distinct from the information in expert forecasts and leads its consensus value. Interestingly, news coverage has become increasingly negative across all states in the past half-century…(More)”.

The Collaboverse: A Collaborative Data-Sharing and Speech Analysis Platform


Paper by Justin D. Dvorak and Frank R. Boutsen: “Collaboration in the field of speech-language pathology occurs across a variety of digital devices and can entail the usage of multiple software tools, systems, file formats, and even programming languages. Unfortunately, gaps between the laboratory, clinic, and classroom can emerge in part because of siloing of data and workflows, as well as the digital divide between users. The purpose of this tutorial is to present the Collaboverse, a web-based collaborative system that unifies these domains, and describe the application of this tool to common tasks in speech-language pathology. In addition, we demonstrate its utility in machine learning (ML) applications…

This tutorial outlines key concepts in the digital divide, data management, distributed computing, and ML. It introduces the Collaboverse workspace for researchers, clinicians, and educators in speech-language pathology who wish to improve their collaborative network and leverage advanced computation abilities. It also details an ML approach to prosodic analysis….

The Collaboverse shows promise in narrowing the digital divide and is capable of generating clinically relevant data, specifically in the area of prosody, whose computational complexity has limited widespread analysis in research and clinic alike. In addition, it includes an augmentative and alternative communication app allowing visual, nontextual communication…(More)”.

Finding, distinguishing, and understanding overlooked policy entrepreneurs


Paper by Gwen Arnold, Meghan Klasic, Changtong Wu, Madeline Schomburg & Abigail York: “Scholars have spent decades arguing that policy entrepreneurs, change agents who work individually and in groups to influence the policy process, can be crucial in introducing policy innovation and spurring policy change. How to identify policy entrepreneurs empirically has received less attention. This oversight is consequential because scholars trying to understand when policy entrepreneurs emerge, and why, and what makes them more or less successful, need to be able to identify these change agents reliably and accurately. This paper explores the ways policy entrepreneurs are currently identified and highlights issues with current approaches. We introduce a new technique for eliciting and distinguishing policy entrepreneurs, coupling automated and manual analysis of local news media and a survey of policy entrepreneur candidates. We apply this technique to the empirical case of unconventional oil and gas drilling in Pennsylvania and derive some tentative results concerning factors which increase entrepreneurial efficacy…(More)”.

Protecting Policy Space for Indigenous Data Sovereignty Under International Digital Trade Law


Paper by Andrew D. Mitchell and Theo Samlidis: “The impact of economic agreements on Indigenous peoples’ broader rights and interests has been subject to ongoing scrutiny. Technological developments and an increasing emphasis on Indigenous sovereignty within the digital domain have given rise to a global Indigenous data sovereignty movement, surfacing concerns about how international economic law impacts Indigenous peoples’ sovereignty over their data. This Article examines the policy space certain governments have reserved under international economic agreements to introduce measures for protecting Indigenous data or digital sovereignty (IDS). We argue that treaty countries have secured, under recent international digital trade chapters and agreements, the benefits of a comprehensive economic treaty and sufficient regulatory autonomy to protect Indigenous data sovereignty…(More)”