Rejecting Public Utility Data Monopolies


Paper by Amy L. Stein: “The threat of monopoly power looms large today. Although not the telecommunications and tobacco monopolies of old, the Goliaths of Big Tech have become today’s target for potential antitrust violations. It is not only their control over the social media infrastructure and digital advertising technologies that gives people pause, but their monopolistic collection, use, and sale of customer data. But large technology companies are not the only private companies that have exclusive access to your data; that can crowd out competitors; and that can hold, use, or sell your data with little to no regulation. These other private companies are not data companies, platforms, or even brokers. They are public utilities.

Although termed “public utilities,” these entities are overwhelmingly private, shareholder-owned entities. Like private Big Tech, utilities gather incredible amounts of data from customers and use this data in various ways. And like private Big Tech, these utilities can exercise exclusionary and self-dealing anticompetitive behavior with respect to customer data. But there is one critical difference— unlike Big Tech, utilities enjoy an implied immunity from antitrust laws. This state action immunity has historically applied to utility provision of essential services like electricity and heat. As utilities find themselves in the position of unsuspecting data stewards, however, there is a real and unexplored question about whether their long- enjoyed antitrust immunity should extend to their data practices.

As the first exploration of this question, this Article tests the continuing application and rationale of the state action immunity doctrine to the evolving services that a utility provides as the grid becomes digitized. It demonstrates the importance of staunching the creep of state action immunity over utility data practices. And it recognizes the challenges of developing remedies for such data practices that do not disrupt the state-sanctioned monopoly powers of utilities over the provision of essential services. This Article analyzes both antitrust and regulatory remedies, including a new customer- focused “data duty,” as possible mechanisms to enhance consumer (ratepayer) welfare in this space. Exposing utility data practices to potential antitrust liability may be just the lever that is needed to motivate states, public utility commissions, and utilities to develop a more robust marketplace for energy data…(More)”.

Generative Discrimination: What Happens When Generative AI Exhibits Bias, and What Can Be Done About It


Paper by Philipp Hacker, Frederik Zuiderveen Borgesius, Brent Mittelstadt and Sandra Wachter: “Generative AI (genAI) technologies, while beneficial, risk increasing discrimination by producing demeaning content and subtle biases through inadequate representation of protected groups. This chapter examines these issues, categorizing problematic outputs into three legal categories: discriminatory content; harassment; and legally hard cases like harmful stereotypes. It argues for holding genAI providers and deployers liable for discriminatory outputs and highlights the inadequacy of traditional legal frameworks to address genAI-specific issues. The chapter suggests updating EU laws to mitigate biases in training and input data, mandating testing and auditing, and evolving legislation to enforce standards for bias mitigation and inclusivity as technology advances…(More)”.

The problem of ‘model collapse’: how a lack of human data limits AI progress


Article by Michael Peel: “The use of computer-generated data to train artificial intelligence models risks causing them to produce nonsensical results, according to new research that highlights looming challenges to the emerging technology. 

Leading AI companies, including OpenAI and Microsoft, have tested the use of “synthetic” data — information created by AI systems to then also train large language models (LLMs) — as they reach the limits of human-made material that can improve the cutting-edge technology.

Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models. One trial using synthetic input text about medieval architecture descended into a discussion of jackrabbits after fewer than 10 generations of output. 

The work underlines why AI developers have hurried to buy troves of human-generated data for training — and raises questions of what will happen once those finite sources are exhausted. 

“Synthetic data is amazing if we manage to make it work,” said Ilia Shumailov, lead author of the research. “But what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens.”

The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training.

The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used. 

The early stages of collapse typically involve a “loss of variance”, which means majority subpopulations in the data become progressively over-represented at the expense of minority groups. In late-stage collapse, all parts of the data may descend into gibberish…(More)”.

Anonymization: The imperfect science of using data while preserving privacy


Paper by Andrea Gadotti et al: “Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today…(More)”.

The Data That Powers A.I. Is Disappearing Fast


Article by Kevin Roose: “For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources…(More)”.

Governance of deliberative mini-publics: emerging consensus and divergent views


Paper by Lucy J. Parry, Nicole Curato, and , and John S. Dryzek: “Deliberative mini-publics are forums for citizen deliberation composed of randomly selected citizens convened to yield policy recommendations. These forums have proliferated in recent years but there are no generally accepted standards to govern their practice. Should there be? We answer this question by bringing the scholarly literature on citizen deliberation into dialogue with the lived experience of the people who study, design and implement mini-publics. We use Q methodology to locate five distinct perspectives on the integrity of mini-publics, and map the structure of agreement and dispute across them. We find that, across the five viewpoints, there is emerging consensus as well as divergence on integrity issues, with disagreement over what might be gained or lost by adapting common standards of practice, and possible sources of integrity risks. This article provides an empirical foundation for further discussion on integrity standards in the future…(More)”.

Precision public health in the era of genomics and big data


Paper by Megan C. Roberts et al: “Precision public health (PPH) considers the interplay between genetics, lifestyle and the environment to improve disease prevention, diagnosis and treatment on a population level—thereby delivering the right interventions to the right populations at the right time. In this Review, we explore the concept of PPH as the next generation of public health. We discuss the historical context of using individual-level data in public health interventions and examine recent advancements in how data from human and pathogen genomics and social, behavioral and environmental research, as well as artificial intelligence, have transformed public health. Real-world examples of PPH are discussed, emphasizing how these approaches are becoming a mainstay in public health, as well as outstanding challenges in their development, implementation and sustainability. Data sciences, ethical, legal and social implications research, capacity building, equity research and implementation science will have a crucial role in realizing the potential for ‘precision’ to enhance traditional public health approaches…(More)”.

Integrating Artificial Intelligence into Citizens’ Assemblies: Benefits, Concerns and Future Pathways


Paper by Sammy McKinney: “Interest in how Artificial Intelligence (AI) could be used within citizens’ assemblies (CAs) is emerging amongst scholars and practitioners alike. In this paper, I make four contributions at the intersection of these burgeoning fields. First, I propose an analytical framework to guide evaluations of the benefits and limitations of AI applications in CAs. Second, I map out eleven ways that AI, especially large language models (LLMs), could be used across a CAs full lifecycle. This introduces novel ideas for AI integration into the literature and synthesises existing proposals to provide the most detailed analytical breakdown of AI applications in CAs to date. Third, drawing on relevant literature, four key informant interviews, and the Global Assembly on the Ecological and Climate crisis as a case study, I apply my analytical framework to assess the desirability of each application. This provides insight into how AI could be deployed to address existing  challenges facing CAs today as well as the concerns that arise with AI integration. Fourth, bringing my analyses together, I argue that AI integration into CAs brings the potential to enhance their democratic quality and institutional capacity, but realising this requires the deliberative community to proceed cautiously, effectively navigate challenging trade-offs, and mitigate important concerns that arise with AI integration. Ultimately, this paper provides a foundation that can guide future research concerning AI integration into CAs and other forms of democratic innovation…(More)”.

The Great Scrape: The Clash Between Scraping and Privacy


Paper by Daniel J. Solove and Woodrow Hartzog: “Artificial intelligence (AI) systems depend on massive quantities of data, often gathered by “scraping” – the automated extraction of large amounts of data from the internet. A great deal of scraped data is about people. This personal data provides the grist for AI tools such as facial recognition, deep fakes, and generative AI. Although scraping enables web searching, archival, and meaningful scientific research, scraping for AI can also be objectionable or even harmful to individuals and society.

Organizations are scraping at an escalating pace and scale, even though many privacy laws are seemingly incongruous with the practice. In this Article, we contend that scraping must undergo a serious reckoning with privacy law.  Scraping violates nearly all of the key principles in privacy laws, including fairness; individual rights and control; transparency; consent; purpose specification and secondary use restrictions; data minimization; onward transfer; and data security. With scraping, data protection laws built around these requirements are ignored.

Scraping has evaded a reckoning with privacy law largely because scrapers act as if all publicly available data were free for the taking. But the public availability of scraped data shouldn’t give scrapers a free pass. Privacy law regularly protects publicly available data, and privacy principles are implicated even when personal data is accessible to others.

This Article explores the fundamental tension between scraping and privacy law. With the zealous pursuit and astronomical growth of AI, we are in the midst of what we call the “great scrape.” There must now be a great reconciliation…(More)”.

(Almost) 200 Years of News-Based Economic Sentiment


Paper by Jules H. van Binsbergen, Svetlana Bryzgalova, Mayukh Mukhopadhyay & Varun Sharma: “Using text from 200 million pages of 13,000 US local newspapers and machine learning methods, we construct a 170-year-long measure of economic sentiment at the country and state levels, that expands existing measures in both the time series (by more than a century) and the cross-section. Our measure predicts GDP (both nationally and locally), consumption, and employment growth, even after controlling for commonly-used predictors, as well as monetary policy decisions. Our measure is distinct from the information in expert forecasts and leads its consensus value. Interestingly, news coverage has become increasingly negative across all states in the past half-century…(More)”.