×
Why Some Experts Believe Synthetic Data Will Degrade Future Models
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The proliferation of AI-generated junk web pages poses a significant challenge to the future development and performance of AI models, as training on increasingly synthetic data can lead to degraded output quality and potential model collapse.

Key takeaways from the research: A study published in Nature demonstrates that the quality of an AI model’s output gradually deteriorates when trained on data generated by other AI models:

  • The effect worsens as subsequent models produce output that is then used as training data for future models, likened to taking photos of photos and eventually being left with a dark square or “model collapse.”
  • Large AI models, such as GPT-3, which rely on internet data for training, are particularly susceptible to this issue as the number of AI-generated junk websites continues to grow.

Implications for AI model performance: While current AI models may not face immediate collapse, the research suggests that there could be significant effects on their performance and development:

  • Improvements in AI models may slow down, and their performance might suffer as a result of training on increasingly synthetic data.
  • Information affecting minority groups and underrepresented languages could be heavily distorted in the models, as they tend to overfocus on more prevalent samples in the training data.

Potential solutions and challenges: Researchers propose several ideas to mitigate the negative effects of training on AI-generated data, but these solutions come with their own challenges:

  • Giving more weight to the original human-generated data in the training process could help avoid degradation, but this requires a way to distinguish between human-generated and AI-generated content on the internet.
  • Creating a trail from the original human-generated data to further generations, known as data provenance, is another potential solution, but accurately determining whether text is AI-generated remains a challenge.

Broader implications for the future of AI: As AI models continue to rely on large-scale data for optimal performance, the increasing presence of AI-generated content on the internet raises important questions about the quality and diversity of training data:

  • The diminishing returns of crawling more web data may lead to a growing reliance on synthetic data, which could exacerbate the issues highlighted in the research.
  • Ensuring the trustworthiness and representativeness of training data will be crucial for the future development of AI models, but the path forward remains uncertain, with more questions than answers at present.
AI trained on AI garbage spits out AI garbage

Recent News

Large Language Poor Role Model: Lawyer dismissed for using ChatGPT’s false citations

A recent law graduate faces career consequences after submitting ChatGPT-generated fictional legal precedents, highlighting professional risks in AI adoption without proper verification.

Meta taps atomic energy for AI in Big Tech nuclear trend

Tech companies are turning to nuclear power plants as reliable carbon-free energy sources to meet the enormous electricity demands of their AI operations.

AI applications weirdly missing from today’s tech landscape

Despite AI's rapid advancement, developers have largely defaulted to chatbot interfaces, overlooking opportunities for semantic search, real-time fact checking, and AI-assisted debate tools that could transform how we interact with information.