AI Models Risk "Collapse" When Trained on AI-Generated Data, Study Warns
A new study published in Nature has raised alarm bells about the future of artificial intelligence (AI) development, warning that training AI models on AI-generated text can lead to rapid deterioration in output quality. This phenomenon, dubbed "model collapse," could potentially halt progress in large language models (LLMs) as they exhaust human-derived training data and increasingly encounter AI-generated content online.
Researchers from the University of Cambridge and the University of Oxford conducted the study, which demonstrates how successive generations of AI models trained on synthetic data quickly devolve into producing nonsensical output. The findings have significant implications for the AI industry, which has largely relied on ever-increasing amounts of data to improve model performance.
"The message is, we have to be very careful about what ends up in our training data," warns co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge. "Otherwise, things will always, provably, go wrong."
The study's methodology involved using an initial LLM to create Wikipedia-style entries, then training new iterations of the model on text produced by its predecessor. As AI-generated information, or "synthetic data," contaminated the training set, the model's outputs became increasingly incoherent. By the ninth iteration, the model was producing gibberish, such as including a treatise on jackrabbit tail colours in an article about English church towers.
More subtly, the researchers observed that even before complete collapse, models trained on AI-derived texts began to lose information about less frequently mentioned topics. This raises concerns about fairness and representation in AI systems, as co-author Ilia Shumailov explains: "Low-probability events often relate to marginalized groups."
The study's findings have sent ripples through the AI community. Julia Kempe, a computer scientist at New York University, describes it as "a fantastic paper" and "a call to arms" for researchers to find solutions to the problem.
The collapse occurs because each model iteration samples only from its training data, amplifying errors and biases with each generation. Common words become more prevalent, while rarer terms are increasingly omitted. Hany Farid, a computer scientist at the University of California, Berkeley, likens the phenomenon to genetic inbreeding: "If a species inbreeds with their own offspring and doesn't diversify their gene pool, it can lead to a collapse of the species."
While model collapse doesn't mean LLMs will cease functioning entirely, it does suggest that the cost and difficulty of improving them will increase. The study challenges the long-held assumption that more data invariably leads to better AI performance.
"As synthetic data build up in the web, the scaling laws that state that models should get better the more data they train on are likely to break," Kempe notes. This is because training data will lose the richness and variety inherent in human-generated content.
The research team found that including a small percentage of real data alongside synthetic data slowed the collapse but did not prevent it entirely. A separate study by Stanford University researchers suggested that when synthetic data accumulates alongside real data rather than replacing it, catastrophic model collapse is less likely.
However, the long-term implications remain concerning. As AI-generated content proliferates online, distinguishing between human-created and AI-produced text will become increasingly challenging. This could lead to a feedback loop where AI models are inadvertently trained on mor and more synthetic data.
To address this issue, the study's authors suggest several potential solutions. These include developing methods to watermark AI-generated content, creating incentives for humans to continue producing original content, and implementing rigorous filtering and curation processes for training data.
"Our work shows that if you can prune it properly, the phenomenon can be partly or maybe fully avoided," Kempe states, referring to the careful selection and filtering of training data.