Cohere expands multilingual AI capabilities: Cohere has launched two new open-weight models, Aya Expanse 8B and 35B, as part of its Aya project aimed at bridging the global language divide in foundation models.
Key advancements in multilingual AI:
- The Aya Expanse models expand performance advancements in 23 languages, building on the success of the previously released Aya 101 large language model.
- The 8B parameter model makes breakthroughs more accessible to researchers worldwide, while the 35B parameter model provides state-of-the-art multilingual capabilities.
- Both models are now available on Hugging Face, a popular platform for sharing and accessing AI models.
Technical innovations driving performance:
- Cohere utilized data arbitrage, a sampling method designed to avoid the generation of gibberish that can occur when models rely on synthetic data.
- The company focused on guiding the models toward “global preferences” and accounting for different cultural and linguistic perspectives.
- Preference training and safety measures were extended to a massively multilingual setting, addressing the limitations of Western-centric datasets.
Benchmark performance:
- Cohere claims that Aya Expanse models consistently outperformed similar-sized AI models from competitors such as Google, Mistral, and Meta.
- The Aya Expanse 35B model reportedly performed better in benchmark multilingual tests than Gemma 2 27B, Mistral 8x22B, and even the larger Llama 3.1 70B.
- The smaller 8B model also showed superior performance compared to Gemma 2 9B, Llama 3.1 8B, and Ministral 8B.
Addressing the language gap in AI:
- The Aya project aims to expand access to foundation models in more global languages beyond English.
- Cohere for AI, the company’s research arm, launched the Aya initiative last year and released the Aya dataset to help expand access to other languages for model training.
- The initiative focuses on ensuring research around large language models (LLMs) that perform well in languages other than English, addressing the difficulty in finding diverse language data for training.
Industry context and challenges:
- Many LLMs eventually become available in widely spoken languages, but there is difficulty in finding data to train models with less common languages.
- English dominance in official, financial, and digital communication makes it easier to find training data in English compared to other languages.
- Accurately benchmarking the performance of models in different languages can be challenging due to the quality of translations.
Broader industry efforts:
- Other developers, such as OpenAI, have also released multilingual datasets to further research into non-English LLMs.
- OpenAI recently made its Multilingual Massive Multitask Language Understanding Dataset available on Hugging Face, aiming to better test LLM performance across 14 languages.
Cohere’s recent developments:
- The company has been actively enhancing its AI offerings, recently adding image search capabilities to Embed 3, its enterprise embedding product used in retrieval augmented generation (RAG) systems.
- Cohere also improved fine-tuning for its Command R 08-2024 model this month, demonstrating ongoing commitment to advancing AI technology.
Implications for global AI accessibility: The release of Aya Expanse models represents a significant step towards democratizing AI across languages, potentially enabling more diverse and inclusive AI applications worldwide. However, challenges remain in data collection and accurate performance evaluation for less-resourced languages, highlighting the need for continued research and collaboration in multilingual AI development.
Cohere launches new AI models to bridge global language divide