×
How to Use GPT-4o for Web Scraping
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI-assisted web scraping with GPT-4o: OpenAI’s new structured outputs feature in their API has opened up exciting possibilities for AI-assisted web scraping, as demonstrated by a recent experiment using GPT-4o.

Initial approach and model selection:

  • The experiment utilized Pydantic models to define the structure for parsed columns and tables
  • A system prompt was crafted to instruct GPT-4o on its role as an expert web scraper
  • GPT-4o outperformed GPT-4o mini in parsing accuracy, leading to its selection for further experimentation

Performance on complex tables:

  • GPT-4o successfully parsed a 10-day weather forecast table from Weather.com, correctly handling varying row sizes and hidden data
  • The model accurately extracted day and night forecasts, demonstrating its ability to interpret complex table structures
  • Interestingly, GPT-4o parsed “Condition” data that was present in the source code but not visible on the website

Challenges with merged rows:

  • Tables from Wikipedia with merged rows posed difficulties for the model
  • Attempts to modify the system prompt to handle collapsed rows were unsuccessful
  • Further experimentation is needed to improve the model’s performance on tables with merged cells

XPath generation approach:

  • To reduce API call costs, an attempt was made to have GPT-4o generate XPaths instead of parsed data
  • This method proved less reliable, often resulting in invalid or incorrect XPaths

Combined extraction and XPath generation:

  • A two-step approach was developed: first extracting data, then using it as a reference to generate XPaths
  • This combined method yielded better results than directly asking for XPaths
  • Retry logic was implemented to handle cases where generated XPaths returned no results
  • Issues arose when the model converted image data to text, causing mismatches in the second step

Cost considerations and optimizations:

  • GPT-4o usage for web scraping can become expensive due to the large amount of text in HTML tables
  • A cleanup function was implemented to remove unnecessary HTML attributes, reducing character count by half without degrading performance
  • Further optimizations, such as generating multiple XPaths per API call, are being considered

Demo and future improvements:

  • A Streamlit demo was created to showcase the AI-assisted web scraping tool
  • Potential enhancements include capturing browser events for better user experience, developing more complex extraction methods for intricate tables, and further optimizing HTML cleanup to reduce API costs

Broader implications: While GPT-4o shows promise in AI-assisted web scraping, the high costs associated with its use present a significant challenge. This experiment highlights both the potential and limitations of using large language models for data extraction tasks, suggesting that further refinements in efficiency and cost-effectiveness are necessary for widespread adoption in web scraping applications.

Using GPT-4o for web scraping

Recent News

Study reveals 4 ways AI is transforming sexual wellness

AI-powered tools offer relationship advice rated more empathetic than human responses.

In the Money: Google tests interactive finance charts in AI Mode for stock comparisons

Finance queries serve as Google's testing ground for broader data visualization across other subjects.

30 mathematicians met in secret to stump OpenAI. They (mostly) failed.

Mathematicians may soon shift from solving problems to collaborating with AI reasoning bots.