×
Beyond the benchmarks: How DeepSeek-R1 and OpenAI’s o1 stack up on real-world challenges
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

DeepSeek-R1 and OpenAI’s o1 models were tested in real-world data analysis and market research tasks using Perplexity Pro Search to evaluate their practical capabilities beyond standard benchmarks.

Core findings: Side-by-side testing revealed both models have significant capabilities but also notable limitations when handling complex data analysis tasks.

  • R1 demonstrated superior transparency in its reasoning process, making it easier to identify and correct errors
  • o1 showed slightly better reasoning capabilities but provided less insight into how it reached its conclusions
  • Both models struggled with tasks requiring specific data retrieval and multi-step calculations

Investment analysis performance: The models were tasked with calculating returns on investment for the Magnificent Seven stocks across 2024, revealing significant limitations.

  • Both models failed to accurately calculate ROI for monthly $140 investments spread across seven major tech stocks
  • o1 provided incomplete calculations and incorrect conclusions about returns
  • R1’s transparency helped identify that the failure stemmed from inadequate data retrieval rather than reasoning capabilities

Data processing capabilities: When provided with direct file input containing stock data, the models showed different approaches to handling structured information.

  • o1 suggested manual calculations in Excel rather than performing the analysis
  • R1 successfully parsed HTML data and performed calculations but failed to present the final results clearly
  • A stock split in Nvidia’s data caused calculation errors, highlighting the models’ sensitivity to unexpected data variations

Sports statistics analysis: The models performed better when analyzing NBA player statistics, though still showed room for improvement.

  • Both models correctly identified Giannis as having the best field goal percentage improvement
  • Initial prompts led to inclusion of irrelevant data for rookie Victor Wembanyama
  • R1 provided more comprehensive results with source attribution and comparison tables
  • More specific prompting improved accuracy for both models

Looking ahead: While both models show promise in handling real-world tasks, significant development is still needed for reliable autonomous operation.

  • The need for precise prompting remains critical for achieving accurate results
  • R1’s transparent reasoning process provides valuable feedback for prompt optimization
  • Future iterations, including OpenAI’s upcoming o3 series, may address current limitations in transparency and reliability
  • The success of these models often depends on the quality of their data retrieval systems rather than just reasoning capabilities

Practical implications: The testing reveals that while these models are powerful tools, they require careful human oversight and clear, specific instructions to produce reliable results – highlighting the continuing importance of human expertise in artificial intelligence applications.

Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks

Recent News

An inside look at OpenAI’s new AI agent ‘Operator’

ChatGPT's new web-browsing assistant can handle routine online tasks but requires human oversight and performs best on simple, structured websites.

DeepSeek has a censorship problem — here’s how to get around it

Chinese AI model R1's open-source code allows limited ways around state censorship, though core biases remain embedded in its architecture.

Google’s Notebook LM: How it came to be and why it’s so powerful

Google's AI note-taking tool has evolved into a platform that turns documents into conversational experiences while maintaining source accuracy.