Article by Alison Snyder Data to train AI models increasingly comes from other AI models in the form of synthetic data, which can fill in chatbots’ knowledge gaps but also destabilize them.
The big picture: As AI models expand in size, their need for data becomes insatiable — but high quality human-made data is costly, and growing restrictions on the text, images and other kinds of data freely available on the web are driving the technology’s developers toward machine-produced alternatives.
State of play: AI-generated data has been used for years to supplement data in some fields, including medical imaging and computer vision, that use proprietary or private data.
- But chatbots are trained on public data collected from across the internet that is increasingly being restricted — while at the same time, the web is expected to be flooded with AI-generated content.
Those constraints and the decreasing cost of generating synthetic data are spurring companies to use AI-generated data to help train their models.
- Meta, Google, Anthropic and others are using synthetic data — alongside human-generated data — to help train the AI models that power their chatbots.
- Google DeepMind’s new AlphaGeometry 2 system that can solve math Olympiad problems is trained from scratch on synthetic data…(More)”