DeepSeek-R1 and OpenAI’s o1 models were tested in real-world data analysis and market research tasks using Perplexity Pro Search to evaluate their practical capabilities beyond standard benchmarks.
Core findings: Side-by-side testing revealed both models have significant capabilities but also notable limitations when handling complex data analysis tasks.
- R1 demonstrated superior transparency in its reasoning process, making it easier to identify and correct errors
- o1 showed slightly better reasoning capabilities but provided less insight into how it reached its conclusions
- Both models struggled with tasks requiring specific data retrieval and multi-step calculations
Investment analysis performance: The models were tasked with calculating returns on investment for the Magnificent Seven stocks across 2024, revealing significant limitations.
- Both models failed to accurately calculate ROI for monthly $140 investments spread across seven major tech stocks
- o1 provided incomplete calculations and incorrect conclusions about returns
- R1’s transparency helped identify that the failure stemmed from inadequate data retrieval rather than reasoning capabilities
Data processing capabilities: When provided with direct file input containing stock data, the models showed different approaches to handling structured information.
- o1 suggested manual calculations in Excel rather than performing the analysis
- R1 successfully parsed HTML data and performed calculations but failed to present the final results clearly
- A stock split in Nvidia’s data caused calculation errors, highlighting the models’ sensitivity to unexpected data variations
Sports statistics analysis: The models performed better when analyzing NBA player statistics, though still showed room for improvement.
- Both models correctly identified Giannis as having the best field goal percentage improvement
- Initial prompts led to inclusion of irrelevant data for rookie Victor Wembanyama
- R1 provided more comprehensive results with source attribution and comparison tables
- More specific prompting improved accuracy for both models
Looking ahead: While both models show promise in handling real-world tasks, significant development is still needed for reliable autonomous operation.
- The need for precise prompting remains critical for achieving accurate results
- R1’s transparent reasoning process provides valuable feedback for prompt optimization
- Future iterations, including OpenAI’s upcoming o3 series, may address current limitations in transparency and reliability
- The success of these models often depends on the quality of their data retrieval systems rather than just reasoning capabilities
Practical implications: The testing reveals that while these models are powerful tools, they require careful human oversight and clear, specific instructions to produce reliable results – highlighting the continuing importance of human expertise in artificial intelligence applications.
Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks