back
Get SIGNAL/NOISE in your inbox daily

AI training relies heavily on news content: Recent research by Ziff Davis reveals that major AI companies are increasingly prioritizing content from reputable news sources when training their large language models.

  • Google, OpenAI, and Meta are among the tech giants placing greater emphasis on high-quality news content for AI training purposes.
  • The study examined open-source replicas of datasets commonly used by AI companies, including Common Crawl, C4, OpenWebText, and OpenWebText2.
  • OpenAI, in particular, gives more weight to high-quality datasets, including news media, copyrighted books, and popular Reddit posts when training its models.

Quantifying media’s importance in AI development: The study provides concrete figures demonstrating the significant role that news media plays in the creation of AI chatbots and language models.

  • Nearly 13.5% of URLs in WebText2, a dataset used for AI training, come from just 15 top media publishers.
  • This quantification highlights the extent to which AI companies rely on news content to improve their models’ performance and knowledge base.
  • Despite this reliance, there is currently no legal obligation for AI companies to compensate publishers for the use of their content in training datasets.

Potential impact on the publishing industry: The use of news content for AI training without compensation raises concerns about the future of journalism and media organizations.

  • Publishers may face lost licensing revenue as their content is used to train AI models without payment.
  • There are fears that this practice could potentially put some media outlets out of business if they are not adequately compensated for their contributions to AI development.
  • The situation highlights the growing tension between the tech and media industries as AI continues to advance and rely on high-quality content.

Legal landscape and ongoing disputes: Recent legal actions have brought the issue of content usage in AI training to the forefront, with mixed results so far.

  • A federal judge recently dismissed a lawsuit against OpenAI filed by Raw Story and AlterNet, which alleged unauthorized use of their content in AI training.
  • However, a similar case brought by The New York Times against OpenAI is still ongoing, indicating that the legal questions surrounding this issue are far from settled.
  • These legal battles underscore the complex interplay between copyright law, fair use, and the rapidly evolving field of AI technology.

Industry responses and adaptations: Some AI companies are taking steps to address concerns and establish more transparent practices regarding content usage.

  • OpenAI has begun signing licensing deals with certain media companies, potentially setting a precedent for future collaborations between AI developers and content creators.
  • The company’s ChatGPT search feature now includes citations for some sources when summarizing content, increasing transparency and potentially providing a model for attributing information used in AI responses.
  • These actions suggest a growing recognition within the AI industry of the need to address concerns about content usage and attribution.

Implications for future negotiations: The findings of the Ziff Davis study could significantly impact the ongoing debate over content usage in AI training.

  • Media companies may leverage this data to strengthen their arguments for copyright protection or compensation for their content used in AI training.
  • The quantification of news media’s importance in AI development could provide publishers with more bargaining power in negotiations with tech companies.
  • This research may contribute to shaping future policies and industry standards regarding the use of copyrighted content in AI training datasets.

Broader context and potential outcomes: The reliance on news content for AI training highlights the intricate relationship between technology and journalism in the digital age.

  • As AI continues to advance, the value of high-quality, factual content becomes increasingly apparent, potentially leading to a renewed appreciation for professional journalism.
  • The outcome of current legal disputes and industry negotiations could set important precedents for how intellectual property is handled in the age of AI.
  • Balancing the needs of AI development with the sustainability of quality journalism will likely remain a critical challenge for both industries in the coming years.

Recent Stories

Oct 17, 2025

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...

Oct 17, 2025

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...

Oct 17, 2025

Vatican launches Latin American AI network for human development

The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...