×
Written by
Published on
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI-assisted web scraping with GPT-4o: OpenAI’s new structured outputs feature in their API has opened up exciting possibilities for AI-assisted web scraping, as demonstrated by a recent experiment using GPT-4o.

Initial approach and model selection:

  • The experiment utilized Pydantic models to define the structure for parsed columns and tables
  • A system prompt was crafted to instruct GPT-4o on its role as an expert web scraper
  • GPT-4o outperformed GPT-4o mini in parsing accuracy, leading to its selection for further experimentation

Performance on complex tables:

  • GPT-4o successfully parsed a 10-day weather forecast table from Weather.com, correctly handling varying row sizes and hidden data
  • The model accurately extracted day and night forecasts, demonstrating its ability to interpret complex table structures
  • Interestingly, GPT-4o parsed “Condition” data that was present in the source code but not visible on the website

Challenges with merged rows:

  • Tables from Wikipedia with merged rows posed difficulties for the model
  • Attempts to modify the system prompt to handle collapsed rows were unsuccessful
  • Further experimentation is needed to improve the model’s performance on tables with merged cells

XPath generation approach:

  • To reduce API call costs, an attempt was made to have GPT-4o generate XPaths instead of parsed data
  • This method proved less reliable, often resulting in invalid or incorrect XPaths

Combined extraction and XPath generation:

  • A two-step approach was developed: first extracting data, then using it as a reference to generate XPaths
  • This combined method yielded better results than directly asking for XPaths
  • Retry logic was implemented to handle cases where generated XPaths returned no results
  • Issues arose when the model converted image data to text, causing mismatches in the second step

Cost considerations and optimizations:

  • GPT-4o usage for web scraping can become expensive due to the large amount of text in HTML tables
  • A cleanup function was implemented to remove unnecessary HTML attributes, reducing character count by half without degrading performance
  • Further optimizations, such as generating multiple XPaths per API call, are being considered

Demo and future improvements:

  • A Streamlit demo was created to showcase the AI-assisted web scraping tool
  • Potential enhancements include capturing browser events for better user experience, developing more complex extraction methods for intricate tables, and further optimizing HTML cleanup to reduce API costs

Broader implications: While GPT-4o shows promise in AI-assisted web scraping, the high costs associated with its use present a significant challenge. This experiment highlights both the potential and limitations of using large language models for data extraction tasks, suggesting that further refinements in efficiency and cost-effectiveness are necessary for widespread adoption in web scraping applications.

Using GPT-4o for web scraping

Recent News

AI Tutors Double Student Learning in Harvard Study

Students using an AI tutor demonstrated twice the learning gains in half the time compared to traditional lectures, suggesting potential for more efficient and personalized education.

Lionsgate Teams Up With Runway On Custom AI Video Generation Model

The studio aims to develop AI tools for filmmakers using its vast library, raising questions about content creation and creative rights.

How to Successfully Integrate AI into Project Management Practices

AI-powered tools automate routine tasks, analyze data for insights, and enhance decision-making, promising to boost productivity and streamline project management across industries.