×

What does it do?

  • Speech Recognition
  • Speech Transcription
  • Speech Translation
  • Multilingual Support
  • Noise Robustness

How is it used?

  • Open-source ASR tool converting audio to text via encoder-decoder.
  • 1. Split audio 2. Convert to spectrograms 3. Pass thru encoder 4. Train decoder 5. Predict text caption

Who is it good for?

  • Developers
  • Researchers
  • Accessibility Advocates
  • Journalists
  • Language Learners

What does it cost?

  • Pricing model : Unknown

Details & Features

  • Made By

    OpenAI
  • Released On

    2015-07-10

Whisper is an automatic speech recognition (ASR) system developed by OpenAI that approaches human-level robustness and accuracy in English speech recognition. It leverages a large and diverse dataset of 680,000 hours of multilingual and multitask supervised data collected from the web to achieve strong performance.

Key features:
- Multilingual support for transcribing speech in multiple languages and translating into English
- Robustness to diverse accents and background noise for improved real-world accuracy
- Recognition of technical terms and jargon for enhanced performance in specialized domains
- Strong zero-shot performance across various datasets without fine-tuning
- Open-sourced models and inference code for easy developer integration

How it works:
Whisper uses an end-to-end approach implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into log-Mel spectrograms, passed into an encoder, and a decoder predicts the corresponding text caption. The decoder is trained to perform tasks like language identification, phrase-level timestamps, multilingual speech transcription, and speech translation to English.

Integrations:
As an open-source system, Whisper can be integrated by developers into various applications. No specific product integrations are mentioned.

Use of AI:
Whisper leverages generative AI through its Transformer-based architecture to learn complex patterns in speech data and generate accurate transcriptions.

AI foundation model:
Whisper is built on a custom Transformer-based architecture, which is a type of large language model (LLM) that can effectively process sequential speech data.

How to access:
The Whisper models and inference code are open-sourced and publicly available, allowing developers and researchers to use the system in their own speech recognition and natural language processing applications. Whisper was launched on September 21, 2022 by OpenAI, an artificial intelligence research company founded in 2015.

  • Supported ecosystems
    Google Colab, GitHub, OpenAI, iOS, Apple, Android, Google, Android, Google, iOS, Apple
  • What does it do?
    Speech Recognition, Speech Transcription, Speech Translation, Multilingual Support, Noise Robustness
  • Who is it good for?
    Developers, Researchers, Accessibility Advocates, Journalists, Language Learners

PRICING

Visit site
Pricing model: Unknown

Alternatives

Vocode is an open-source platform that enables users to build, deploy, and scale hyperrealistic voice AI agents.
Otter.ai transcribes meetings, interviews, and lectures in real-time, offering collaboration tools.
AssemblyAI is an AI-powered speech recognition and natural language processing platform that transcribes and analyzes audio data.
Deepgram provides APIs for speech-to-text, text-to-speech, and language understanding for developers.
Notta is an AI notetaker that transcribes, translates, and summarizes meetings in multiple languages.
Cloudmersive's scalable cloud APIs convert audio to text and vice versa using advanced AI and NLP techniques.
Create realistic voiceovers in 1,000+ voices across 142 languages with emotion fine-tuning and voice cloning.