Rejecting Public Utility Data Monopolies


Paper by Amy L. Stein: “The threat of monopoly power looms large today. Although not the telecommunications and tobacco monopolies of old, the Goliaths of Big Tech have become today’s target for potential antitrust violations. It is not only their control over the social media infrastructure and digital advertising technologies that gives people pause, but their monopolistic collection, use, and sale of customer data. But large technology companies are not the only private companies that have exclusive access to your data; that can crowd out competitors; and that can hold, use, or sell your data with little to no regulation. These other private companies are not data companies, platforms, or even brokers. They are public utilities.

Although termed “public utilities,” these entities are overwhelmingly private, shareholder-owned entities. Like private Big Tech, utilities gather incredible amounts of data from customers and use this data in various ways. And like private Big Tech, these utilities can exercise exclusionary and self-dealing anticompetitive behavior with respect to customer data. But there is one critical difference— unlike Big Tech, utilities enjoy an implied immunity from antitrust laws. This state action immunity has historically applied to utility provision of essential services like electricity and heat. As utilities find themselves in the position of unsuspecting data stewards, however, there is a real and unexplored question about whether their long- enjoyed antitrust immunity should extend to their data practices.

As the first exploration of this question, this Article tests the continuing application and rationale of the state action immunity doctrine to the evolving services that a utility provides as the grid becomes digitized. It demonstrates the importance of staunching the creep of state action immunity over utility data practices. And it recognizes the challenges of developing remedies for such data practices that do not disrupt the state-sanctioned monopoly powers of utilities over the provision of essential services. This Article analyzes both antitrust and regulatory remedies, including a new customer- focused “data duty,” as possible mechanisms to enhance consumer (ratepayer) welfare in this space. Exposing utility data practices to potential antitrust liability may be just the lever that is needed to motivate states, public utility commissions, and utilities to develop a more robust marketplace for energy data…(More)”.

This is AI’s brain on AI


Article by Alison Snyder Data to train AI models increasingly comes from other AI models in the form of synthetic data, which can fill in chatbots’ knowledge gaps but also destabilize them.

The big picture: As AI models expand in size, their need for data becomes insatiable — but high quality human-made data is costly, and growing restrictions on the text, images and other kinds of data freely available on the web are driving the technology’s developers toward machine-produced alternatives.

State of play: AI-generated data has been used for years to supplement data in some fields, including medical imaging and computer vision, that use proprietary or private data.

  • But chatbots are trained on public data collected from across the internet that is increasingly being restricted — while at the same time, the web is expected to be flooded with AI-generated content.

Those constraints and the decreasing cost of generating synthetic data are spurring companies to use AI-generated data to help train their models.

  • Meta, Google, Anthropic and others are using synthetic data — alongside human-generated data — to help train the AI models that power their chatbots.
  • Google DeepMind’s new AlphaGeometry 2 system that can solve math Olympiad problems is trained from scratch on synthetic data…(More)”

Generative Discrimination: What Happens When Generative AI Exhibits Bias, and What Can Be Done About It


Paper by Philipp Hacker, Frederik Zuiderveen Borgesius, Brent Mittelstadt and Sandra Wachter: “Generative AI (genAI) technologies, while beneficial, risk increasing discrimination by producing demeaning content and subtle biases through inadequate representation of protected groups. This chapter examines these issues, categorizing problematic outputs into three legal categories: discriminatory content; harassment; and legally hard cases like harmful stereotypes. It argues for holding genAI providers and deployers liable for discriminatory outputs and highlights the inadequacy of traditional legal frameworks to address genAI-specific issues. The chapter suggests updating EU laws to mitigate biases in training and input data, mandating testing and auditing, and evolving legislation to enforce standards for bias mitigation and inclusivity as technology advances…(More)”.

A.I. May Save Us, or May Construct Viruses to Kill Us


Article by Nicholas Kristof: “Here’s a bargain of the most horrifying kind: For less than $100,000, it may now be possible to use artificial intelligence to develop a virus that could kill millions of people.

That’s the conclusion of Jason Matheny, the president of the RAND Corporation, a think tank that studies security matters and other issues.

“It wouldn’t cost more to create a pathogen that’s capable of killing hundreds of millions of people versus a pathogen that’s only capable of killing hundreds of thousands of people,” Matheny told me.

In contrast, he noted, it could cost billions of dollars to produce a new vaccine or antiviral in response…

In the early 2000s, some of us worried about smallpox being reintroduced as a bioweapon if the virus were stolen from the labs in Atlanta and in Russia’s Novosibirsk region that retain the virus since the disease was eradicated. But with synthetic biology, now it wouldn’t have to be stolen.

Some years ago, a research team created a cousin of the smallpox virus, horse pox, in six months for $100,000, and with A.I. it could be easier and cheaper to refine the virus.

One reason biological weapons haven’t been much used is that they can boomerang. If Russia released a virus in Ukraine, it could spread to Russia. But a retired Chinese general has raised the possibility of biological warfare that targets particular races or ethnicities (probably imperfectly), which would make bioweapons much more useful. Alternatively, it might be possible to develop a virus that would kill or incapacitate a particular person, such as a troublesome president or ambassador, if one had obtained that person’s DNA at a dinner or reception.

Assessments of ethnic-targeting research by China are classified, but they may be why the U.S. Defense Department has said that the most important long-term threat of biowarfare comes from China.

A.I. has a more hopeful side as well, of course. It holds the promise of improving education, reducing auto accidents, curing cancers and developing miraculous new pharmaceuticals.

One of the best-known benefits is in protein folding, which can lead to revolutionary advances in medical care. Scientists used to spend years or decades figuring out the shapes of individual proteins, and then a Google initiative called AlphaFold was introduced that could predict the shapes within minutes. “It’s Google Maps for biology,” Kent Walker, president of global affairs at Google, told me.

Scientists have since used updated versions of AlphaFold to work on pharmaceuticals including a vaccine against malaria, one of the greatest killers of humans throughout history.

So it’s unclear whether A.I. will save us or kill us first…(More)”.

Future-proofing government data


Article by Amy Jones: “Vast amounts of data are fueling innovation and decision-making, and agencies representing the United States government are custodian to some of the largest repositories of data in the world. As one of the world’s largest data creators and consumers, the federal government has made substantial investments in sourcing, curating, and leveraging data across many domains. However, the increasing reliance on artificial intelligence to extract insights and drive efficiencies necessitates a strategic pivot: agencies must evolve data management practices to identify and discriminate synthetic data from organic sources to safeguard the integrity and utility of data assets.

AI’s transformative potential is contingent on the availability of high-quality data. Data readiness includes attention to quality, accuracy, completeness, consistency, timeliness and relevance, at a minimum, and agencies are adopting robust data governance frameworks that enforce data quality standards at every stage of the data lifecycle. This includes implementing advanced data validation techniques, fostering a culture of data stewardship, and leveraging state-of-the-art tools for continuous data quality monitoring…(More)”.

The problem of ‘model collapse’: how a lack of human data limits AI progress


Article by Michael Peel: “The use of computer-generated data to train artificial intelligence models risks causing them to produce nonsensical results, according to new research that highlights looming challenges to the emerging technology. 

Leading AI companies, including OpenAI and Microsoft, have tested the use of “synthetic” data — information created by AI systems to then also train large language models (LLMs) — as they reach the limits of human-made material that can improve the cutting-edge technology.

Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models. One trial using synthetic input text about medieval architecture descended into a discussion of jackrabbits after fewer than 10 generations of output. 

The work underlines why AI developers have hurried to buy troves of human-generated data for training — and raises questions of what will happen once those finite sources are exhausted. 

“Synthetic data is amazing if we manage to make it work,” said Ilia Shumailov, lead author of the research. “But what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens.”

The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training.

The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used. 

The early stages of collapse typically involve a “loss of variance”, which means majority subpopulations in the data become progressively over-represented at the expense of minority groups. In late-stage collapse, all parts of the data may descend into gibberish…(More)”.

When A.I. Fails the Language Test, Who Is Left Out of the Conversation?


Article by Sara Ruberg: “While the use of A.I. has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. A.I. experts worry that the language gap could exacerbate technological inequities, and that it could leave many regions and cultures behind.

A delay of access to good technology of even a few years, “can potentially lead to a few decades of economic delay,” said Sang Truong, a Ph.D. candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.

The tests his team ran found that A.I. tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the A.I. model to learn from.

Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because A.I. tech development and online engagement is centered in the United States and China. Other low-resource languages include Hindi, Bengali and Swahili, as well as lesser-known dialects spoken by smaller populations around the world.

An analysis of top websites by W3Techs, a tech survey company, found that English makes up over 60 percent of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5 percent of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.

Academic institutions, grass-roots organizations and volunteer efforts are playing catch-up to build resources for speakers of languages who aren’t as well represented in the digital landscape.

Lelapa AI, based in Johannesburg, is one such company leading efforts on the African continent. The South African-based start-up is developing multilingual A.I. products for people and businesses in Africa…(More)”.

Feeding the Machine: The Hidden Human Labor Powering A.I.


Book by Mark Graham, Callum Cant, and James Muldoon: “Silicon Valley has sold us the illusion that artificial intelligence is a frictionless technology that will bring wealth and prosperity to humanity. But hidden beneath this smooth surface lies the grim reality of a precarious global workforce of millions laboring under often appalling conditions to make A.I. possible. This book presents an urgent, riveting investigation of the intricate network that maintains this exploitative system, revealing the untold truth of A.I.

Based on hundreds of interviews and thousands of hours of fieldwork over more than a decade, Feeding the Machine describes the lives of the workers deliberately concealed from view, and the power structures that determine their future. It gives voice to the people whom A.I. exploits, from accomplished writers and artists to the armies of data annotators, content moderators and warehouse workers, revealing how their dangerous, low-paid labor is connected to longer histories of gendered, racialized, and colonial exploitation.

A.I. is an extraction machine that feeds off humanity’s collective effort and intelligence, churning through ever-larger datasets to power its algorithms. This book is a call to arms that details what we need to do to fight for a more just digital future…(More)”.

AI firms will soon exhaust most of the internet’s data


Article by The Economist: “One approach is to focus on data quality rather than quantity. ai labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an ai firm, says that this is the “main differentiator” between ai models on the market. “True information” about the world obviously matters; so does lots of “reasoning”. That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.

These considerations can get even more complex when the data are not just on different subjects but in different forms. In part because of the lack of new textual data, leading models like Openai’s gpt-4o and Google’s Gemini are now let loose on image, video and audio files as well as text during their self-supervised learning. Training on video is hardest given how dense with data points video files are. Current models typically look at a subset of frames to simplify things.

Whatever models are used, ownership is increasingly recognised as an issue. The material used in training llms is often copyrighted and used without consent from, or payment to, the rights holders. Some ai models peep behind paywalls. Model creators claim this sort of thing falls under the “fair use” exemption in American copyright law. ai models should be allowed to read copyrighted material when they learn, just as humans can, they say. But as Benedict Evans, a technology analyst, has put it, “a difference in scale” can lead to “a difference in principle”…

It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in ai. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives…(More)”.

Anonymization: The imperfect science of using data while preserving privacy


Paper by Andrea Gadotti et al: “Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today…(More)”.