×
Written by
Published on
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI Data Scraping Landscape Shifts: OpenAI’s recent licensing agreements with publishers have led to a significant change in how news outlets approach web crawler access, particularly for AI training data collection.

  • The initial surge in blocking OpenAI’s GPTBot through robots.txt files has reversed, with the number of high-ranking media websites disallowing access dropping from a peak of over one-third to around a quarter.
  • Among the most prominent news outlets, the block rate remains above 50%, but this is down from nearly 90% earlier in the year.
  • The trend towards increased blocking appears to have ended, at least temporarily, as more publishers enter into partnerships with OpenAI.

Impact of Licensing Deals: OpenAI’s strategy of forming partnerships with publishers is yielding immediate results in terms of data access.

  • Publishers who have struck deals with OpenAI are updating their robots.txt files to permit crawling, often within days or weeks of announcing their partnerships.
  • Notable examples include The Atlantic, which unblocked OpenAI’s crawlers on the same day as their deal announcement, and Vox, which permitted access about a month after their partnership news.
  • OpenAI has secured agreements with 12 publishers so far, though some, like Time magazine, continue to block GPTBot despite having a deal in place.

Changing Dynamics of Web Crawling: The shift in blocking practices highlights the evolving relationship between AI companies and content creators.

  • Robots.txt, while not legally binding, has long been the standard governing web crawler behavior, with most companies respecting these instructions to maintain good practices.
  • OpenAI now uses “direct feeds” for data from partner publishers, making robots.txt permissions less critical for these sources.
  • Some outlets, like Infowars and The Onion, have unblocked OpenAI’s crawler without announcing partnerships, raising questions about their motivations or potential undisclosed agreements.

Industry Perspectives: The changing landscape has prompted various reactions from industry players and experts.

  • Originality AI CEO Jon Gillham suggests that OpenAI views being blocked as a threat to their future ambitions, driving their push for partnerships.
  • The Onion’s CEO Ben Collins strongly denied any business relationship with OpenAI, attributing their unblocking to a recent website migration.
  • Data journalist Ben Welsh has been tracking these changes, noting the slight decline in block rates across news outlets.

Future Implications: The current trend raises questions about the long-term strategies of both AI companies and publishers.

  • There’s speculation that publishers might use blocking as a bargaining tactic in future negotiations with AI companies.
  • The success of OpenAI’s partnership approach could influence how other AI companies pursue data access and training strategies.
  • This shift marks a significant change in the industry-wide response to AI scraping, moving from a unified blocking approach to a more nuanced, partnership-driven model.

Broader Context: The evolving situation reflects the complex relationship between AI development and content creation in the digital age.

  • The initial rush to block AI crawlers stemmed from concerns about unauthorized use of content for AI training.
  • OpenAI’s success in negotiating partnerships demonstrates the potential for collaborative approaches between AI companies and content creators.
  • As the AI industry continues to grow, the balance between data access, content rights, and technological advancement remains a critical issue for all stakeholders.
The Race to Block OpenAI’s Scraping Bots Is Slowing Down

Recent News

How to turn any FAQ into an AI chatbot using Dify and ChatGPT

Dify offers a straightforward method to convert static FAQ pages into interactive chatbots, enhancing user engagement and information retrieval on websites.

Using LLMs? Here’s where you may be wasting the most money

The inefficiency of making small changes to AI-generated content highlights the need for more flexible editing tools in large language models.

How to navigate data drift and bias in enterprise AI adoption

Organizations must prioritize data quality management and regularly adapt AI models to maintain accuracy and fairness in the face of evolving data patterns and inherent biases.