×
How to Use GPT-4o for Web Scraping
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI-assisted web scraping with GPT-4o: OpenAI’s new structured outputs feature in their API has opened up exciting possibilities for AI-assisted web scraping, as demonstrated by a recent experiment using GPT-4o.

Initial approach and model selection:

  • The experiment utilized Pydantic models to define the structure for parsed columns and tables
  • A system prompt was crafted to instruct GPT-4o on its role as an expert web scraper
  • GPT-4o outperformed GPT-4o mini in parsing accuracy, leading to its selection for further experimentation

Performance on complex tables:

  • GPT-4o successfully parsed a 10-day weather forecast table from Weather.com, correctly handling varying row sizes and hidden data
  • The model accurately extracted day and night forecasts, demonstrating its ability to interpret complex table structures
  • Interestingly, GPT-4o parsed “Condition” data that was present in the source code but not visible on the website

Challenges with merged rows:

  • Tables from Wikipedia with merged rows posed difficulties for the model
  • Attempts to modify the system prompt to handle collapsed rows were unsuccessful
  • Further experimentation is needed to improve the model’s performance on tables with merged cells

XPath generation approach:

  • To reduce API call costs, an attempt was made to have GPT-4o generate XPaths instead of parsed data
  • This method proved less reliable, often resulting in invalid or incorrect XPaths

Combined extraction and XPath generation:

  • A two-step approach was developed: first extracting data, then using it as a reference to generate XPaths
  • This combined method yielded better results than directly asking for XPaths
  • Retry logic was implemented to handle cases where generated XPaths returned no results
  • Issues arose when the model converted image data to text, causing mismatches in the second step

Cost considerations and optimizations:

  • GPT-4o usage for web scraping can become expensive due to the large amount of text in HTML tables
  • A cleanup function was implemented to remove unnecessary HTML attributes, reducing character count by half without degrading performance
  • Further optimizations, such as generating multiple XPaths per API call, are being considered

Demo and future improvements:

  • A Streamlit demo was created to showcase the AI-assisted web scraping tool
  • Potential enhancements include capturing browser events for better user experience, developing more complex extraction methods for intricate tables, and further optimizing HTML cleanup to reduce API costs

Broader implications: While GPT-4o shows promise in AI-assisted web scraping, the high costs associated with its use present a significant challenge. This experiment highlights both the potential and limitations of using large language models for data extraction tasks, suggesting that further refinements in efficiency and cost-effectiveness are necessary for widespread adoption in web scraping applications.

Using GPT-4o for web scraping

Recent News

Tim Cook tells Apple staff AI is “as big as the internet”

The rare all-hands meeting signals mounting pressure as talent flees to competitors.

Google adds 4 new AI search features including image analysis

Desktop users can now upload PDFs and images for instant AI analysis.

Take that, Oppenheimer: Meta offers AI researcher $250M over 4 years in talent war

Young researchers now hire agents and share negotiation strategies in private chat groups.