There's a new open leaderboard just for Japanese LLMs

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

The development of a comprehensive evaluation system for Japanese Large Language Models marks a significant advancement in assessing AI capabilities for one of the world’s major languages.

Project overview: The Open Japanese LLM Leaderboard, a collaborative effort between Hugging Face and LLM-jp, introduces a pioneering evaluation framework for Japanese language models.

The initiative addresses a critical gap in LLM assessment by focusing specifically on Japanese language processing capabilities
The evaluation system encompasses more than 20 diverse datasets, testing models across multiple Natural Language Processing (NLP) tasks
All evaluations utilize a 4-shot prompt format, providing consistent testing conditions across different models

Technical infrastructure: The leaderboard’s robust technical foundation combines several cutting-edge tools and platforms to ensure reliable evaluation results.

The system leverages Hugging Face’s Inference endpoints for model testing
Implementation relies on the llm-jp-eval library and vLLM for efficient processing
Japan’s mdx computing platform provides the necessary computational resources

Dataset composition: The evaluation framework incorporates a diverse range of specialized datasets designed to test various aspects of language understanding and generation.

Jamp tests temporal inference abilities in Japanese context
JEMHopQA challenges models with multi-hop question answering
JMMLU evaluates knowledge across different academic and professional subjects
Specialized datasets like chABSA focus on domain-specific tasks such as financial report analysis

Current performance insights: Early results reveal interesting trends in the capabilities of different Japanese language models.

Open-source Japanese LLMs are showing competitive performance against closed-source alternatives in general language tasks
Domain-specific applications continue to present significant challenges for current models
The performance gap between open and closed-source models varies significantly across different types of tasks

Future developments: The leaderboard project has outlined several planned enhancements to expand its evaluation capabilities.

Additional datasets will be incorporated to broaden the assessment scope
Chain-of-thought evaluation support is under development
New metrics will be introduced to provide more comprehensive performance analysis

Looking ahead: The establishment of this leaderboard represents a crucial step in advancing Japanese NLP capabilities, though the varying performance across different tasks suggests that significant work remains in achieving consistent, high-level performance across all domains.

Introducing the Open Leaderboard for Japanese LLMs!

huggingface

Menu

There’s a new open leaderboard just for Japanese LLMs

Recent News

3 insurance trends that will reshape the industry in 2026

3 major AI developments reshape industry governance and safety standards

Chinese startup ships 700+ humanoid robots globally at $5K price point

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

There’s a new open leaderboard just for Japanese LLMs

Recent News

3 insurance trends that will reshape the industry in 2026

3 major AI developments reshape industry governance and safety standards

Chinese startup ships 700+ humanoid robots globally at $5K price point

Join the revolution

CO/AI

Resources

Join the revolution