How to Use GPT-4o for Web Scraping

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

AI-assisted web scraping with GPT-4o: OpenAI’s new structured outputs feature in their API has opened up exciting possibilities for AI-assisted web scraping, as demonstrated by a recent experiment using GPT-4o.

Initial approach and model selection:

The experiment utilized Pydantic models to define the structure for parsed columns and tables
A system prompt was crafted to instruct GPT-4o on its role as an expert web scraper
GPT-4o outperformed GPT-4o mini in parsing accuracy, leading to its selection for further experimentation

Performance on complex tables:

GPT-4o successfully parsed a 10-day weather forecast table from Weather.com, correctly handling varying row sizes and hidden data
The model accurately extracted day and night forecasts, demonstrating its ability to interpret complex table structures
Interestingly, GPT-4o parsed “Condition” data that was present in the source code but not visible on the website

Challenges with merged rows:

Tables from Wikipedia with merged rows posed difficulties for the model
Attempts to modify the system prompt to handle collapsed rows were unsuccessful
Further experimentation is needed to improve the model’s performance on tables with merged cells

XPath generation approach:

To reduce API call costs, an attempt was made to have GPT-4o generate XPaths instead of parsed data
This method proved less reliable, often resulting in invalid or incorrect XPaths

Combined extraction and XPath generation:

A two-step approach was developed: first extracting data, then using it as a reference to generate XPaths
This combined method yielded better results than directly asking for XPaths
Retry logic was implemented to handle cases where generated XPaths returned no results
Issues arose when the model converted image data to text, causing mismatches in the second step

Cost considerations and optimizations:

GPT-4o usage for web scraping can become expensive due to the large amount of text in HTML tables
A cleanup function was implemented to remove unnecessary HTML attributes, reducing character count by half without degrading performance
Further optimizations, such as generating multiple XPaths per API call, are being considered

Demo and future improvements:

A Streamlit demo was created to showcase the AI-assisted web scraping tool
Potential enhancements include capturing browser events for better user experience, developing more complex extraction methods for intricate tables, and further optimizing HTML cleanup to reduce API costs

Broader implications: While GPT-4o shows promise in AI-assisted web scraping, the high costs associated with its use present a significant challenge. This experiment highlights both the potential and limitations of using large language models for data extraction tasks, suggesting that further refinements in efficiency and cost-effectiveness are necessary for widespread adoption in web scraping applications.

Using GPT-4o for web scraping

Eduardo Blancas

Menu

How to Use GPT-4o for Web Scraping

Recent News

Tim Cook tells Apple staff AI is “as big as the internet”

Google adds 4 new AI search features including image analysis

Take that, Oppenheimer: Meta offers AI researcher $250M over 4 years in talent war

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

How to Use GPT-4o for Web Scraping

Recent News

Tim Cook tells Apple staff AI is “as big as the internet”

Google adds 4 new AI search features including image analysis

Take that, Oppenheimer: Meta offers AI researcher $250M over 4 years in talent war

Join the revolution

CO/AI

Resources

Join the revolution