AI-Generated Data Can Poison Future AI Models

Thanks to a boom in generative artificial intelligence, programs that can produce text, computer code, images and music are readily available to the average person. And we’re already using them: AI content is taking over the Internet, and text generated by “large language models” is filling hundreds of websites, including CNET and Gizmodo. But as AI developers scrape the Internet, AI-generated content may soon enter the data sets used to train new models to respond like humans. Some experts say that will inadvertently introduce errors that build up with each succeeding generation of models.

A growing body of evidence supports this idea. It suggests that a training diet of AI-generated text, even in small quantities, eventually becomes “poisonous” to the model being trained. Currently there are few obvious antidotes. “While it may not be an issue right now or in, let’s say, a few months, I believe it will become a consideration in a few years,” says Rik Sarkar, a computer scientist at the School of Informatics at the University of Edinburgh in Scotland.

To read more, click here.