AI-assisted web scraping with GPT-4o: OpenAI’s new structured outputs feature in their API has opened up exciting possibilities for AI-assisted web scraping, as demonstrated by a recent experiment using GPT-4o.
Initial approach and model selection:
- The experiment utilized Pydantic models to define the structure for parsed columns and tables
- A system prompt was crafted to instruct GPT-4o on its role as an expert web scraper
- GPT-4o outperformed GPT-4o mini in parsing accuracy, leading to its selection for further experimentation
Performance on complex tables:
- GPT-4o successfully parsed a 10-day weather forecast table from Weather.com, correctly handling varying row sizes and hidden data
- The model accurately extracted day and night forecasts, demonstrating its ability to interpret complex table structures
- Interestingly, GPT-4o parsed “Condition” data that was present in the source code but not visible on the website
Challenges with merged rows:
- Tables from Wikipedia with merged rows posed difficulties for the model
- Attempts to modify the system prompt to handle collapsed rows were unsuccessful
- Further experimentation is needed to improve the model’s performance on tables with merged cells
XPath generation approach:
- To reduce API call costs, an attempt was made to have GPT-4o generate XPaths instead of parsed data
- This method proved less reliable, often resulting in invalid or incorrect XPaths
Combined extraction and XPath generation:
- A two-step approach was developed: first extracting data, then using it as a reference to generate XPaths
- This combined method yielded better results than directly asking for XPaths
- Retry logic was implemented to handle cases where generated XPaths returned no results
- Issues arose when the model converted image data to text, causing mismatches in the second step
Cost considerations and optimizations:
- GPT-4o usage for web scraping can become expensive due to the large amount of text in HTML tables
- A cleanup function was implemented to remove unnecessary HTML attributes, reducing character count by half without degrading performance
- Further optimizations, such as generating multiple XPaths per API call, are being considered
Demo and future improvements:
- A Streamlit demo was created to showcase the AI-assisted web scraping tool
- Potential enhancements include capturing browser events for better user experience, developing more complex extraction methods for intricate tables, and further optimizing HTML cleanup to reduce API costs
Broader implications: While GPT-4o shows promise in AI-assisted web scraping, the high costs associated with its use present a significant challenge. This experiment highlights both the potential and limitations of using large language models for data extraction tasks, suggesting that further refinements in efficiency and cost-effectiveness are necessary for widespread adoption in web scraping applications.
Using GPT-4o for web scraping