Sarvam AI has developed India’s first open-source audio language model, Shuka v1, by integrating Meta’s Llama model to process voice queries across multiple Indian languages.
Project overview: Shuka v1 represents a significant breakthrough in multilingual audio comprehension, combining Llama’s language processing capabilities with a custom audio encoder to handle voice interactions in ten Indian languages.
- The system utilizes Llama as a decoder to process audio tokens generated by Sarvam’s proprietary audio encoder
- Shuka v1 can accurately interpret and respond to voice queries in languages including Gujarati, Hindi, Kannada, and Marathi
- The open-source nature of the model allows government departments and regulated industries to deploy it on-premises, ensuring data privacy
Technical architecture: Sarvam AI implemented a sophisticated multi-component system that bridges audio input with language processing capabilities.
- The team selected Llama 3’s 8B-Instruct version for its optimal balance of computational efficiency and accuracy
- A custom 60M-parameter projector layer was developed to transform audio representations into Llama-compatible text embeddings
- The system keeps both Llama and the Saaras v1 encoder frozen while only fine-tuning the projector layer, maximizing resource efficiency
Implementation approach: The development team employed strategic methods to achieve high performance while maintaining resource efficiency.
- Fine-tuning focused exclusively on the projector layer to minimize computational requirements
- The training process utilized carefully curated question-answer pairs specific to Indian language datasets
- Llama processes these audio tokens into contextually relevant and linguistically accurate responses
Market impact: The development addresses crucial needs in the Indian market where voice-based interactions are often preferred over text.
- Businesses can now more effectively communicate with customers in multiple Indian languages
- The system enables new applications in education and customer support
- The open-source nature of Shuka v1 makes it accessible for widespread adoption across various sectors
Future trajectory: Looking ahead, this development marks just the beginning of expanded capabilities in multilingual voice AI.
- Sarvam plans to leverage future versions of Llama to enhance Shuka’s capabilities
- The team aims to support additional languages and incorporate larger training datasets
- The project demonstrates the potential for creating sophisticated AI solutions tailored to specific regional needs and linguistic requirements
Enlisting Llama in India’s first open source audio language model