AI Data Scraping Landscape Shifts: OpenAI’s recent licensing agreements with publishers have led to a significant change in how news outlets approach web crawler access, particularly for AI training data collection.
- The initial surge in blocking OpenAI’s GPTBot through robots.txt files has reversed, with the number of high-ranking media websites disallowing access dropping from a peak of over one-third to around a quarter.
- Among the most prominent news outlets, the block rate remains above 50%, but this is down from nearly 90% earlier in the year.
- The trend towards increased blocking appears to have ended, at least temporarily, as more publishers enter into partnerships with OpenAI.
Impact of Licensing Deals: OpenAI’s strategy of forming partnerships with publishers is yielding immediate results in terms of data access.
- Publishers who have struck deals with OpenAI are updating their robots.txt files to permit crawling, often within days or weeks of announcing their partnerships.
- Notable examples include The Atlantic, which unblocked OpenAI’s crawlers on the same day as their deal announcement, and Vox, which permitted access about a month after their partnership news.
- OpenAI has secured agreements with 12 publishers so far, though some, like Time magazine, continue to block GPTBot despite having a deal in place.
Changing Dynamics of Web Crawling: The shift in blocking practices highlights the evolving relationship between AI companies and content creators.
- Robots.txt, while not legally binding, has long been the standard governing web crawler behavior, with most companies respecting these instructions to maintain good practices.
- OpenAI now uses “direct feeds” for data from partner publishers, making robots.txt permissions less critical for these sources.
- Some outlets, like Infowars and The Onion, have unblocked OpenAI’s crawler without announcing partnerships, raising questions about their motivations or potential undisclosed agreements.
Industry Perspectives: The changing landscape has prompted various reactions from industry players and experts.
- Originality AI CEO Jon Gillham suggests that OpenAI views being blocked as a threat to their future ambitions, driving their push for partnerships.
- The Onion’s CEO Ben Collins strongly denied any business relationship with OpenAI, attributing their unblocking to a recent website migration.
- Data journalist Ben Welsh has been tracking these changes, noting the slight decline in block rates across news outlets.
Future Implications: The current trend raises questions about the long-term strategies of both AI companies and publishers.
- There’s speculation that publishers might use blocking as a bargaining tactic in future negotiations with AI companies.
- The success of OpenAI’s partnership approach could influence how other AI companies pursue data access and training strategies.
- This shift marks a significant change in the industry-wide response to AI scraping, moving from a unified blocking approach to a more nuanced, partnership-driven model.
Broader Context: The evolving situation reflects the complex relationship between AI development and content creation in the digital age.
- The initial rush to block AI crawlers stemmed from concerns about unauthorized use of content for AI training.
- OpenAI’s success in negotiating partnerships demonstrates the potential for collaborative approaches between AI companies and content creators.
- As the AI industry continues to grow, the balance between data access, content rights, and technological advancement remains a critical issue for all stakeholders.
Recent Stories
DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment
The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...
Oct 17, 2025Tying it all together: Credo’s purple cables power the $4B AI data center boom
Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...
Oct 17, 2025Vatican launches Latin American AI network for human development
The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...