×
The race to block OpenAI’s web crawlers is slowing
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI Data Scraping Landscape Shifts: OpenAI’s recent licensing agreements with publishers have led to a significant change in how news outlets approach web crawler access, particularly for AI training data collection.

  • The initial surge in blocking OpenAI’s GPTBot through robots.txt files has reversed, with the number of high-ranking media websites disallowing access dropping from a peak of over one-third to around a quarter.
  • Among the most prominent news outlets, the block rate remains above 50%, but this is down from nearly 90% earlier in the year.
  • The trend towards increased blocking appears to have ended, at least temporarily, as more publishers enter into partnerships with OpenAI.

Impact of Licensing Deals: OpenAI’s strategy of forming partnerships with publishers is yielding immediate results in terms of data access.

  • Publishers who have struck deals with OpenAI are updating their robots.txt files to permit crawling, often within days or weeks of announcing their partnerships.
  • Notable examples include The Atlantic, which unblocked OpenAI’s crawlers on the same day as their deal announcement, and Vox, which permitted access about a month after their partnership news.
  • OpenAI has secured agreements with 12 publishers so far, though some, like Time magazine, continue to block GPTBot despite having a deal in place.

Changing Dynamics of Web Crawling: The shift in blocking practices highlights the evolving relationship between AI companies and content creators.

  • Robots.txt, while not legally binding, has long been the standard governing web crawler behavior, with most companies respecting these instructions to maintain good practices.
  • OpenAI now uses “direct feeds” for data from partner publishers, making robots.txt permissions less critical for these sources.
  • Some outlets, like Infowars and The Onion, have unblocked OpenAI’s crawler without announcing partnerships, raising questions about their motivations or potential undisclosed agreements.

Industry Perspectives: The changing landscape has prompted various reactions from industry players and experts.

  • Originality AI CEO Jon Gillham suggests that OpenAI views being blocked as a threat to their future ambitions, driving their push for partnerships.
  • The Onion’s CEO Ben Collins strongly denied any business relationship with OpenAI, attributing their unblocking to a recent website migration.
  • Data journalist Ben Welsh has been tracking these changes, noting the slight decline in block rates across news outlets.

Future Implications: The current trend raises questions about the long-term strategies of both AI companies and publishers.

  • There’s speculation that publishers might use blocking as a bargaining tactic in future negotiations with AI companies.
  • The success of OpenAI’s partnership approach could influence how other AI companies pursue data access and training strategies.
  • This shift marks a significant change in the industry-wide response to AI scraping, moving from a unified blocking approach to a more nuanced, partnership-driven model.

Broader Context: The evolving situation reflects the complex relationship between AI development and content creation in the digital age.

  • The initial rush to block AI crawlers stemmed from concerns about unauthorized use of content for AI training.
  • OpenAI’s success in negotiating partnerships demonstrates the potential for collaborative approaches between AI companies and content creators.
  • As the AI industry continues to grow, the balance between data access, content rights, and technological advancement remains a critical issue for all stakeholders.
The Race to Block OpenAI’s Scraping Bots Is Slowing Down

Recent News

AI search engine Genspark adds Claude-powered financial reporting

The platform uses AI to simplify complex financial data, making it accessible to individual investors and business professionals without specialized expertise.

The most promising use cases for agentic AI

AI agents are poised to transform business operations by autonomously handling complex tasks and decision-making across various sectors, from software development to customer service.

ESPN is testing an AI-powered avatar called ‘FACTS’

ESPN's AI avatar aims to deliver real-time statistical analysis during college football broadcasts, potentially transforming how viewers consume sports data.