×
How to Use GPT-4o for Web Scraping
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI-assisted web scraping with GPT-4o: OpenAI’s new structured outputs feature in their API has opened up exciting possibilities for AI-assisted web scraping, as demonstrated by a recent experiment using GPT-4o.

Initial approach and model selection:

  • The experiment utilized Pydantic models to define the structure for parsed columns and tables
  • A system prompt was crafted to instruct GPT-4o on its role as an expert web scraper
  • GPT-4o outperformed GPT-4o mini in parsing accuracy, leading to its selection for further experimentation

Performance on complex tables:

  • GPT-4o successfully parsed a 10-day weather forecast table from Weather.com, correctly handling varying row sizes and hidden data
  • The model accurately extracted day and night forecasts, demonstrating its ability to interpret complex table structures
  • Interestingly, GPT-4o parsed “Condition” data that was present in the source code but not visible on the website

Challenges with merged rows:

  • Tables from Wikipedia with merged rows posed difficulties for the model
  • Attempts to modify the system prompt to handle collapsed rows were unsuccessful
  • Further experimentation is needed to improve the model’s performance on tables with merged cells

XPath generation approach:

  • To reduce API call costs, an attempt was made to have GPT-4o generate XPaths instead of parsed data
  • This method proved less reliable, often resulting in invalid or incorrect XPaths

Combined extraction and XPath generation:

  • A two-step approach was developed: first extracting data, then using it as a reference to generate XPaths
  • This combined method yielded better results than directly asking for XPaths
  • Retry logic was implemented to handle cases where generated XPaths returned no results
  • Issues arose when the model converted image data to text, causing mismatches in the second step

Cost considerations and optimizations:

  • GPT-4o usage for web scraping can become expensive due to the large amount of text in HTML tables
  • A cleanup function was implemented to remove unnecessary HTML attributes, reducing character count by half without degrading performance
  • Further optimizations, such as generating multiple XPaths per API call, are being considered

Demo and future improvements:

  • A Streamlit demo was created to showcase the AI-assisted web scraping tool
  • Potential enhancements include capturing browser events for better user experience, developing more complex extraction methods for intricate tables, and further optimizing HTML cleanup to reduce API costs

Broader implications: While GPT-4o shows promise in AI-assisted web scraping, the high costs associated with its use present a significant challenge. This experiment highlights both the potential and limitations of using large language models for data extraction tasks, suggesting that further refinements in efficiency and cost-effectiveness are necessary for widespread adoption in web scraping applications.

Using GPT-4o for web scraping

Recent News

67% of EU businesses struggle to understand AI Act compliance

Critical guidance remains unpublished just weeks before key deadlines take effect.

Google AI Pro now offers annual billing at $199.99, saving users 16%

The plan bundles 2TB storage with Gemini access and video generation tools.

Everyday AI Value: Five Below’s 4-step AI blueprint drives 19.5% sales growth

Strategic focus on business constraints beats the typical "scaling meetings" trap that derails most AI initiatives.