×
The race to block OpenAI’s web crawlers is slowing
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI Data Scraping Landscape Shifts: OpenAI’s recent licensing agreements with publishers have led to a significant change in how news outlets approach web crawler access, particularly for AI training data collection.

  • The initial surge in blocking OpenAI’s GPTBot through robots.txt files has reversed, with the number of high-ranking media websites disallowing access dropping from a peak of over one-third to around a quarter.
  • Among the most prominent news outlets, the block rate remains above 50%, but this is down from nearly 90% earlier in the year.
  • The trend towards increased blocking appears to have ended, at least temporarily, as more publishers enter into partnerships with OpenAI.

Impact of Licensing Deals: OpenAI’s strategy of forming partnerships with publishers is yielding immediate results in terms of data access.

  • Publishers who have struck deals with OpenAI are updating their robots.txt files to permit crawling, often within days or weeks of announcing their partnerships.
  • Notable examples include The Atlantic, which unblocked OpenAI’s crawlers on the same day as their deal announcement, and Vox, which permitted access about a month after their partnership news.
  • OpenAI has secured agreements with 12 publishers so far, though some, like Time magazine, continue to block GPTBot despite having a deal in place.

Changing Dynamics of Web Crawling: The shift in blocking practices highlights the evolving relationship between AI companies and content creators.

  • Robots.txt, while not legally binding, has long been the standard governing web crawler behavior, with most companies respecting these instructions to maintain good practices.
  • OpenAI now uses “direct feeds” for data from partner publishers, making robots.txt permissions less critical for these sources.
  • Some outlets, like Infowars and The Onion, have unblocked OpenAI’s crawler without announcing partnerships, raising questions about their motivations or potential undisclosed agreements.

Industry Perspectives: The changing landscape has prompted various reactions from industry players and experts.

  • Originality AI CEO Jon Gillham suggests that OpenAI views being blocked as a threat to their future ambitions, driving their push for partnerships.
  • The Onion’s CEO Ben Collins strongly denied any business relationship with OpenAI, attributing their unblocking to a recent website migration.
  • Data journalist Ben Welsh has been tracking these changes, noting the slight decline in block rates across news outlets.

Future Implications: The current trend raises questions about the long-term strategies of both AI companies and publishers.

  • There’s speculation that publishers might use blocking as a bargaining tactic in future negotiations with AI companies.
  • The success of OpenAI’s partnership approach could influence how other AI companies pursue data access and training strategies.
  • This shift marks a significant change in the industry-wide response to AI scraping, moving from a unified blocking approach to a more nuanced, partnership-driven model.

Broader Context: The evolving situation reflects the complex relationship between AI development and content creation in the digital age.

  • The initial rush to block AI crawlers stemmed from concerns about unauthorized use of content for AI training.
  • OpenAI’s success in negotiating partnerships demonstrates the potential for collaborative approaches between AI companies and content creators.
  • As the AI industry continues to grow, the balance between data access, content rights, and technological advancement remains a critical issue for all stakeholders.
The Race to Block OpenAI’s Scraping Bots Is Slowing Down

Recent News

Grok stands alone as X restricts AI training on posts in new policy update

X explicitly bans third-party AI companies from using tweets for model training while still preserving access for its own Grok AI.

Coming out of the dark: Shadow AI usage surges in enterprise IT

IT leaders report 90% concern over unauthorized AI tools, with most organizations already suffering negative consequences including data leaks and financial losses.

Anthropic CEO opposes 10-year AI regulation ban in NYT op-ed

As AI capabilities rapidly accelerate, Anthropic's chief executive argues for targeted federal transparency standards rather than blocking state-level regulation for a decade.