Whisper
What does it do?
- Speech Recognition
- Speech Transcription
- Speech Translation
- Multilingual Support
- Noise Robustness
How is it used?
- Open-source ASR tool converting audio to text via encoder-decoder.
- 1. Split audio
- 2. Convert to spectrograms
- 3. Pass thru encoder
- 4. Train decoder
Who is it good for?
- Developers
- Researchers
- Accessibility Advocates
- Journalists
- Language Learners
What does it cost?
- Pricing model : Unknown
Details & Features
-
Made By
OpenAI -
Released On
2015-10-24
Whisper is an automatic speech recognition (ASR) system that transcribes and translates spoken language with high accuracy across multiple languages. This open-source tool, developed by OpenAI, is designed to handle diverse accents, background noise, and technical language, making it suitable for a wide range of real-world applications.
Key features:
- Multilingual Support: Transcribes speech in multiple languages and translates from those languages into English.
- Robustness to Accents and Noise: Trained to handle diverse accents and background noise for improved accuracy in real-world scenarios.
- Technical Language Support: Recognizes technical terms and jargon, enhancing performance in specialized domains.
- Zero-Shot Performance: Demonstrates strong performance across various datasets without specific fine-tuning.
- Ease of Use: Open-source models and inference code allow for simple integration into applications.
How it works:
1. Audio input is split into 30-second chunks.
2. Audio chunks are converted into log-Mel spectrograms.
3. Spectrograms are passed through an encoder.
4. A decoder predicts the corresponding text caption, including special tokens for language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.
Use of AI:
Whisper utilizes a Transformer-based architecture, a type of generative artificial intelligence. This architecture enables the model to learn complex patterns in speech data and generate accurate transcriptions.
AI foundation model:
Whisper is built on a custom Transformer-based architecture, which is a type of large language model (LLM). This architecture allows the model to process sequential data like speech effectively.
Target users:
- Developers working on speech recognition applications
- Researchers in natural language processing
How to access:
Whisper is available as open-source models and inference code, allowing developers to integrate it into their applications.
-
Supported ecosystemsGoogle Colab, GitHub, OpenAI, iOS, Apple, Android, Google, Android, Google, iOS, Apple
-
What does it do?Speech Recognition, Speech Transcription, Speech Translation, Multilingual Support, Noise Robustness
-
Who is it good for?Developers, Researchers, Accessibility Advocates, Journalists, Language Learners
PRICING
Visit site| Pricing model: Unknown |