×
Written by
Published on
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI-assisted web scraping with GPT-4o: OpenAI’s new structured outputs feature in their API has opened up exciting possibilities for AI-assisted web scraping, as demonstrated by a recent experiment using GPT-4o.

Initial approach and model selection:

  • The experiment utilized Pydantic models to define the structure for parsed columns and tables
  • A system prompt was crafted to instruct GPT-4o on its role as an expert web scraper
  • GPT-4o outperformed GPT-4o mini in parsing accuracy, leading to its selection for further experimentation

Performance on complex tables:

  • GPT-4o successfully parsed a 10-day weather forecast table from Weather.com, correctly handling varying row sizes and hidden data
  • The model accurately extracted day and night forecasts, demonstrating its ability to interpret complex table structures
  • Interestingly, GPT-4o parsed “Condition” data that was present in the source code but not visible on the website

Challenges with merged rows:

  • Tables from Wikipedia with merged rows posed difficulties for the model
  • Attempts to modify the system prompt to handle collapsed rows were unsuccessful
  • Further experimentation is needed to improve the model’s performance on tables with merged cells

XPath generation approach:

  • To reduce API call costs, an attempt was made to have GPT-4o generate XPaths instead of parsed data
  • This method proved less reliable, often resulting in invalid or incorrect XPaths

Combined extraction and XPath generation:

  • A two-step approach was developed: first extracting data, then using it as a reference to generate XPaths
  • This combined method yielded better results than directly asking for XPaths
  • Retry logic was implemented to handle cases where generated XPaths returned no results
  • Issues arose when the model converted image data to text, causing mismatches in the second step

Cost considerations and optimizations:

  • GPT-4o usage for web scraping can become expensive due to the large amount of text in HTML tables
  • A cleanup function was implemented to remove unnecessary HTML attributes, reducing character count by half without degrading performance
  • Further optimizations, such as generating multiple XPaths per API call, are being considered

Demo and future improvements:

  • A Streamlit demo was created to showcase the AI-assisted web scraping tool
  • Potential enhancements include capturing browser events for better user experience, developing more complex extraction methods for intricate tables, and further optimizing HTML cleanup to reduce API costs

Broader implications: While GPT-4o shows promise in AI-assisted web scraping, the high costs associated with its use present a significant challenge. This experiment highlights both the potential and limitations of using large language models for data extraction tasks, suggesting that further refinements in efficiency and cost-effectiveness are necessary for widespread adoption in web scraping applications.

Using GPT-4o for web scraping

Recent News

Google’s AI Tool ‘Food Mood’ Will Help You Create Mouth-Watering Meals

Google's new AI tool blends cuisines from different countries to create unique recipes for adventurous home cooks.

How AI is Reshaping Holiday Retail Shopping

Retailers embrace AI and social media to attract Gen Z shoppers, while addressing economic concerns and staffing challenges for the upcoming holiday season.

How AI Could Widen — or Bridge — the Global Inequality Gap

The technology's impact on social disparity hinges on ethical development, equitable access, and proactive regulation.