AI firms will soon exhaust most of the internet’s data


Article by The Economist: “One approach is to focus on data quality rather than quantity. ai labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an ai firm, says that this is the “main differentiator” between ai models on the market. “True information” about the world obviously matters; so does lots of “reasoning”. That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.

These considerations can get even more complex when the data are not just on different subjects but in different forms. In part because of the lack of new textual data, leading models like Openai’s gpt-4o and Google’s Gemini are now let loose on image, video and audio files as well as text during their self-supervised learning. Training on video is hardest given how dense with data points video files are. Current models typically look at a subset of frames to simplify things.

Whatever models are used, ownership is increasingly recognised as an issue. The material used in training llms is often copyrighted and used without consent from, or payment to, the rights holders. Some ai models peep behind paywalls. Model creators claim this sort of thing falls under the “fair use” exemption in American copyright law. ai models should be allowed to read copyrighted material when they learn, just as humans can, they say. But as Benedict Evans, a technology analyst, has put it, “a difference in scale” can lead to “a difference in principle”…

It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in ai. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives…(More)”.