×
LLM benchmark compares Phi-4, Qwen2 VL 72B and Aya Expanse 32B, finding interesting results
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

A new round of language model benchmarking reveals updated performance metrics for several AI models including Phi-4 variants, Qwen2 VL 72B Instruct, and Aya Expanse 32B using the MMLU-Pro Computer Science benchmark.

Benchmark methodology and scope; The MMLU-Pro Computer Science benchmark evaluates AI models through 410 multiple-choice questions with 10 options each, focusing on complex reasoning rather than just factual recall.

  • Testing was conducted over 103 hours with multiple runs per model to ensure consistency and measure performance variability
  • Results are displayed with error bars showing standard deviation across test runs
  • The benchmark was limited to computer science topics to maintain practical testing timeframes while ensuring relevance for real-world applications

Key findings on new models; Recent testing revealed varied performance across several new AI models, with some showing unexpected results.

  • Microsoft’s Phi-4 and its variants demonstrated comparable performance, with the GGUF version showing slightly higher accuracy
  • Temperature settings significantly impacted Phi-4’s performance, with optimal results at moderate settings
  • Qwen2 VL 72B Instruct showed lower than expected scores, suggesting room for improvement in future versions
  • Aya Expanse 32B, while scoring above 50%, ranked lowest among included models but offers valuable multilingual capabilities

Technical implementation details; The benchmark presentation included innovative visualization techniques to better represent model characteristics.

  • Models were visualized using 3D bars showing MMLU scores, parameter counts, and memory efficiency
  • For quantized models, bar sections were color-coded to show memory savings compared to full-precision models
  • Multiple evaluation runs were conducted for key models like Claude, Gemini-1.5-pro-002, and Athene-V2-Chat

Performance nuances; Testing revealed interesting characteristics specific to certain models.

  • Phi-4 showed improved German language capabilities despite its smaller size
  • Basic prompt engineering could bypass censorship restrictions in tested models
  • Model consistency varied significantly across different temperature settings

Looking ahead; The benchmark results suggest several trends in AI model development and evaluation methodologies that warrant attention in future testing iterations.

  • The need for multiple test runs to establish reliable performance metrics is becoming increasingly important
  • Balancing comprehensive testing with practical time constraints remains a key challenge
  • Future releases, particularly in the Qwen series, may significantly alter the current performance landscape
🐺🐦‍⬛ LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark

Recent News

AI-powered Darth Vader shocks fans with unexpected profanity

The AI Darth Vader voice in Fortnite responded to player inputs with profanity, forcing Epic Games to implement a rapid fix to protect the iconic character's image.

AI minds may differ radically from human cognition

AI systems operate on statistical pattern-matching rather than human-like understanding, requiring a fundamental shift in how we conceptualize and develop artificial intelligence.

AI job shifts challenge effectiveness of worker retraining programs

Traditional workforce development programs struggle to adapt to AI's rapid, cross-sector disruption of job markets.