Article by E. Glen Weyl and Raul Castro Fernandez: “The fight over the data that trains artificial intelligence has become one of the defining economic conflicts of the decade. Publishers, authors, and visual artists argue that their work was taken without permission or payment. AI companies counter that training on available data constitutes fair use and that even if a market in data were desirable, compensating millions of creators is technically impossible: the cost of figuring out what any given piece of data is worth, researchers have argued, would swallow most of the value that data creates in the first place.
Both sides stand to benefit from a fair solution to this impasse and the creation of a sustainable market for content. Any resolution must take both positions seriously while seeing past their literal inconsistency.
On the compensation issue, while content creators are justified in defending their livelihoods, creating a market that ensures fair compensation going forward will arguably serve them better than being paid out for past infractions as existing lawsuits have focused on. And for AI companies, a high-quality continued supply of data that the sector needs for future models, together with legal certainty, is worth more than whatever they save by not paying creators now. As to the technical feasibility, while the data-valuation techniques proposed in research so far are impractical, industry leaders have known since at least 2021—per documents from Anthropic’s Chris Olah and Dario Amodei that surfaced in legal discovery for one of the lawsuits against the company—that low-cost methods exist that could create a thriving market.
This article, which is based on our research, describes how a sustainable market for compensating content creators could work, and why it addresses an important part of the social and economic concerns about an AI future. The fact is that AI companies already produce the two data sets required for pricing content, as a matter of course, every time a model is trained.
The first is the data mixture. This is the proportions in which a model builder blends different kinds of data, which reveal the relative value of each source. For example, quality journalism may be weighed more highly than comments on social media, indicating that it is more valuable—and putting a specific number on how much more valuable. The second is scaling laws. These are the empirical regularities that AI researchers estimate to predict how model performance will respond to additional data and compute. Such estimates, together with economic theory, reveal what share of a model’s total value can be attributed to its training data. Together they show how to slice the pie and how big it is…(More)”.