×
New research shows Big Tech still isn’t fairly compensating news agencies for AI training data
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI training relies heavily on news content: Recent research by Ziff Davis reveals that major AI companies are increasingly prioritizing content from reputable news sources when training their large language models.

  • Google, OpenAI, and Meta are among the tech giants placing greater emphasis on high-quality news content for AI training purposes.
  • The study examined open-source replicas of datasets commonly used by AI companies, including Common Crawl, C4, OpenWebText, and OpenWebText2.
  • OpenAI, in particular, gives more weight to high-quality datasets, including news media, copyrighted books, and popular Reddit posts when training its models.

Quantifying media’s importance in AI development: The study provides concrete figures demonstrating the significant role that news media plays in the creation of AI chatbots and language models.

  • Nearly 13.5% of URLs in WebText2, a dataset used for AI training, come from just 15 top media publishers.
  • This quantification highlights the extent to which AI companies rely on news content to improve their models’ performance and knowledge base.
  • Despite this reliance, there is currently no legal obligation for AI companies to compensate publishers for the use of their content in training datasets.

Potential impact on the publishing industry: The use of news content for AI training without compensation raises concerns about the future of journalism and media organizations.

  • Publishers may face lost licensing revenue as their content is used to train AI models without payment.
  • There are fears that this practice could potentially put some media outlets out of business if they are not adequately compensated for their contributions to AI development.
  • The situation highlights the growing tension between the tech and media industries as AI continues to advance and rely on high-quality content.

Legal landscape and ongoing disputes: Recent legal actions have brought the issue of content usage in AI training to the forefront, with mixed results so far.

  • A federal judge recently dismissed a lawsuit against OpenAI filed by Raw Story and AlterNet, which alleged unauthorized use of their content in AI training.
  • However, a similar case brought by The New York Times against OpenAI is still ongoing, indicating that the legal questions surrounding this issue are far from settled.
  • These legal battles underscore the complex interplay between copyright law, fair use, and the rapidly evolving field of AI technology.

Industry responses and adaptations: Some AI companies are taking steps to address concerns and establish more transparent practices regarding content usage.

  • OpenAI has begun signing licensing deals with certain media companies, potentially setting a precedent for future collaborations between AI developers and content creators.
  • The company’s ChatGPT search feature now includes citations for some sources when summarizing content, increasing transparency and potentially providing a model for attributing information used in AI responses.
  • These actions suggest a growing recognition within the AI industry of the need to address concerns about content usage and attribution.

Implications for future negotiations: The findings of the Ziff Davis study could significantly impact the ongoing debate over content usage in AI training.

  • Media companies may leverage this data to strengthen their arguments for copyright protection or compensation for their content used in AI training.
  • The quantification of news media’s importance in AI development could provide publishers with more bargaining power in negotiations with tech companies.
  • This research may contribute to shaping future policies and industry standards regarding the use of copyrighted content in AI training datasets.

Broader context and potential outcomes: The reliance on news content for AI training highlights the intricate relationship between technology and journalism in the digital age.

  • As AI continues to advance, the value of high-quality, factual content becomes increasingly apparent, potentially leading to a renewed appreciation for professional journalism.
  • The outcome of current legal disputes and industry negotiations could set important precedents for how intellectual property is handled in the age of AI.
  • Balancing the needs of AI development with the sustainability of quality journalism will likely remain a critical challenge for both industries in the coming years.
Google, OpenAI Heavily Weight News Content in AI Training Without Payment

Recent News

Stanford HAI’s 2025 AI predictions: Collaborative agents, skepticism and new risks

AI teams are shifting from standalone agents to specialized groups that collaborate with human supervisors, as development efforts focus on real-world implementation and measurable results.

Artificial Emotional Intelligence: How AI is decoding human feelings

Companies are developing AI systems to recognize facial expressions, voice patterns, and body language, though concerns about privacy and accuracy across cultures remain significant hurdles.

The best AI tools for holiday gift shopping

Intel's latest training method teaches AI models to learn by comparing their own outputs, reducing the need for massive datasets and cutting computing costs dramatically.