artificial intelligence

AI Scraping and the Open Web

Article by James Grimmelmann: “…In response to these tussles, various groups have started trying to create new versions of robots.txt for the AI age. Many of these proposals focus on making REP more granular. Instead of a just binary decision—allow or disallow access—they add mechanisms for websites to place conditions on the usage of the contents scraped from it. This is not the first such attempt—a group of publishers proposed a system called the Automated Content Access Protocol in 2006 that was never widely adopted—but these new ones have more industry support and momentum.

Cloudflare’s Content Signals Policy (CSL) extends robots.txt with new syntax to differentiate using scraped content for search engines, AI model training, and AI inference. A group of publishers and content platforms has backed a more complicated set of extensions called Really Simple Licensing (RSL) that also includes restrictions on allowed users (for example, personal versus commercial versus educational) and countries or regions (for example, the U.S. but not the EU). And Creative Commons (disclosure: I am a member of its Board of Directors) is exploring a set of “preference signals” that would allow reuse of scraped content under certain conditions (for example, that any AI-generated outputs provide appropriate attribution of the source of data).

At the same time, some of these same groups are trying to extend REP into something more ambitious: a framework for websites and scrapers to negotiate payment and content licensing terms. Cloudflare is experimenting with using the HTTP response code 402 PAYMENT REQUIRED to direct scrapers into a “pay per crawl” system. RSL, for its part, includes detailed provisions for publishers to specify commercial licensing terms; for example, they might require scrapers to pay a specified fee per AI output made based on the content.

Going even further, other extensions to RSL include protocols for crawlers to authenticate themselves, and for sites to provide trusted crawlers with access to encrypted content. This is a full-fledged copyright licensing scheme built on the foundation—or perhaps on the ruins—of REP.

Preserving the Best of the Open Web

CSP, RSL, and similar proposals are a meaningful improvement on the ongoing struggle between websites and AI companies. They could greatly reduce the ongoing technical burdens of rampant scraping, and they could resolve many disputes through licensing rather than litigation. A future where AI companies and authors agree on payment for training data is better than a future where the AI companies just grab everything they can and the authors respond only by suing.

But at the same time, RSL similar proposals move away from something beautiful about REP: its commitment to the open web. The world of robots.txt was one where it was simply expected, as a matter of course, that people would put content on webpages and share it freely with the world. The legal system protected websites against egregious abuses—like denial-of-service attacks, or wholesale piracy—but it treated ordinary scraping as mostly harmless…(More)”