Article by Aatish Bhatia: “The internet is becoming awash in words and images generated by artificial intelligence.
Sam Altman, OpenAI’s chief executive, wrote in February that the company generated about 100 billion words per day — a million novels’ worth of text, every day, an unknown share of which finds its way onto the internet.
A.I.-generated text may show up as a restaurant review, a dating profile or a social media post. And it may show up as a news article, too: NewsGuard, a group that tracks online misinformation, recently identified over a thousand websites that churn out error-prone A.I.-generated news articles.
In reality, with no foolproof methods to detect this kind of content, much will simply remain undetected.
All this A.I.-generated information can make it harder for us to know what’s real. And it also poses a problem for A.I. companies. As they trawl the web for new data to train their next models on — an increasingly challenging task — they’re likely to ingest some of their own A.I.-generated content, creating an unintentional feedback loop in which what was once the output from one A.I. becomes the input for another.
In the long run, this cycle may pose a threat to A.I. itself. Research has shown that when generative A.I. is trained on a lot of its own output, it can get a lot worse.
Here’s a simple illustration of what happens when an A.I. system is trained on its own output, over and over again:
This is part of a data set of 60,000 handwritten digits.
When we trained an A.I. to mimic those digits, its output looked like this.
This new set was made by an A.I. trained on the previous A.I.-generated digits. What happens if this process continues?
After 20 generations of training new A.I.s on their predecessors’ output, the digits blur and start to erode.
After 30 generations, they converge into a single shape.
While this is a simplified example, it illustrates a problem on the horizon.
Imagine a medical-advice chatbot that lists fewer diseases that match your symptoms, because it was trained on a narrower spectrum of medical knowledge generated by previous chatbots. Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction…(More)”.