A new round of language model benchmarking reveals updated performance metrics for several AI models including Phi-4 variants, Qwen2 VL 72B Instruct, and Aya Expanse 32B using the MMLU-Pro Computer Science benchmark.
Benchmark methodology and scope; The MMLU-Pro Computer Science benchmark evaluates AI models through 410 multiple-choice questions with 10 options each, focusing on complex reasoning rather than just factual recall.
- Testing was conducted over 103 hours with multiple runs per model to ensure consistency and measure performance variability
- Results are displayed with error bars showing standard deviation across test runs
- The benchmark was limited to computer science topics to maintain practical testing timeframes while ensuring relevance for real-world applications
Key findings on new models; Recent testing revealed varied performance across several new AI models, with some showing unexpected results.
- Microsoft’s Phi-4 and its variants demonstrated comparable performance, with the GGUF version showing slightly higher accuracy
- Temperature settings significantly impacted Phi-4’s performance, with optimal results at moderate settings
- Qwen2 VL 72B Instruct showed lower than expected scores, suggesting room for improvement in future versions
- Aya Expanse 32B, while scoring above 50%, ranked lowest among included models but offers valuable multilingual capabilities
Technical implementation details; The benchmark presentation included innovative visualization techniques to better represent model characteristics.
- Models were visualized using 3D bars showing MMLU scores, parameter counts, and memory efficiency
- For quantized models, bar sections were color-coded to show memory savings compared to full-precision models
- Multiple evaluation runs were conducted for key models like Claude, Gemini-1.5-pro-002, and Athene-V2-Chat
Performance nuances; Testing revealed interesting characteristics specific to certain models.
- Phi-4 showed improved German language capabilities despite its smaller size
- Basic prompt engineering could bypass censorship restrictions in tested models
- Model consistency varied significantly across different temperature settings
Looking ahead; The benchmark results suggest several trends in AI model development and evaluation methodologies that warrant attention in future testing iterations.
- The need for multiple test runs to establish reliable performance metrics is becoming increasingly important
- Balancing comprehensive testing with practical time constraints remains a key challenge
- Future releases, particularly in the Qwen series, may significantly alter the current performance landscape
🐺🐦⬛ LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark