Article by Rita Matulionyte: “As artificial intelligence (AI) reaches the peak of its popularity, researchers have warned the industry might be running out of training data – the fuel that runs powerful AI systems. This could slow down the growth of AI models, especially large language models, and may even alter the trajectory of the AI revolution.
But why is a potential lack of data an issue, considering how much there are on the web? And is there a way to address the risk?…
We need a lot of data to train powerful, accurate and high-quality AI algorithms. For instance, ChatGPT was trained on 570 gigabytes of text data, or about 300 billion words.
Similarly, the stable diffusion algorithm (which is behind many AI image-generating apps such as DALL-E, Lensa and Midjourney) was trained on the LIAON-5B dataset comprising of 5.8 billion image-text pairs. If an algorithm is trained on an insufficient amount of data, it will produce inaccurate or low-quality outputs.
The quality of the training data is also important…This is why AI developers seek out high-quality content such as text from books, online articles, scientific papers, Wikipedia, and certain filtered web content. The Google Assistant was trained on 11,000 romance novels taken from self-publishing site Smashwords to make it more conversational.
The AI industry has been training AI systems on ever-larger datasets, which is why we now have high-performing models such as ChatGPT or DALL-E 3. At the same time, research shows online data stocks are growing much slower than datasets used to train AI.
In a paper published last year, a group of researchers predicted we will run out of high-quality text data before 2026 if the current AI training trends continue. They also estimated low-quality language data will be exhausted sometime between 2030 and 2050, and low-quality image data between 2030 and 2060.
AI could contribute up to US$15.7 trillion (A$24.1 trillion) to the world economy by 2030, according to accounting and consulting group PwC. But running out of usable data could slow down its development…(More)”.