Cloudflare Introduces Default Blocking of A.I. Data Scrapers

Article by Natallie Rocha: “Data for A.I. systems has become an increasingly contentious issue. OpenAI, Anthropic, Google and other companies building A.I. systems have amassed reams of information from across the internet to train their A.I. models. High-quality data is particularly prized because it helps A.I. models become more proficient in generating accurate answers, videos and images.

But website publishers, authors, news organizations and other content creators have accused A.I. companies of using their material without permission and payment. Last month, Reddit sued Anthropic, saying the start-up had unlawfully used the data of its more than 100 million daily users to train its A.I. systems. In 2023, The New York Times sued OpenAI and its partner, Microsoft, accusing them of copyright infringement of news content related to A.I. systems. OpenAI and Microsoft have denied those claims.

Some publishers have struck licensing deals with A.I. companies to receive compensation for their content. In May, The Times agreed to license its editorial content to Amazon for use in the tech giant’s A.I. platforms. Axel Springer, Condé Nast and News Corp have also entered into agreements with A.I. companies to receive revenue for the use of their material.

Mark Howard, the chief operating officer of Time, said he welcomed Cloudflare’s move. Data scraping by A.I. companies threatens anyone who creates content, he said, adding that news publishers like Time deserved fair compensation for what they published…(More)”.