Overview
Whisper is OpenAI’s automatic speech recognition (ASR) system trained for speech recognition and translation tasks. It can transcribe speech audio into text in the language it is spoken (ASR) as well as translate it into English (speech translation).For more information on how these models were trained and evaluated, see the paper. This model card follows the Model Cards for Model Reporting framework.
Model Details
Available Models
There are 9 models of different sizes and capabilities:| Size | Parameters | English-only | Multilingual |
|---|---|---|---|
| tiny | 39 M | ✓ | ✓ |
| base | 74 M | ✓ | ✓ |
| small | 244 M | ✓ | ✓ |
| medium | 769 M | ✓ | ✓ |
| large | 1550 M | ✓ | |
| turbo | 798 M | ✓ |
Release Dates
- September 2022: Original model series
- December 2022:
large-v2released with improvements - November 2023:
large-v3released - September 2024:
large-v3-turbo(turbo model) optimized for inference speed
Model Type
Sequence-to-sequence Transformer model for:- Automatic speech recognition (ASR)
- Speech translation
- Spoken language identification
- Voice activity detection
Intended Use
Evaluated Use Cases
The primary intended users of these models are AI researchers studying:- Robustness and generalization of speech processing systems
- Model capabilities, biases, and constraints
- Large-scale weak supervision approaches
Primary Tasks
The models are primarily trained and evaluated on:- ASR: Transcribing speech in ~10 languages
- Speech translation to English: Translating non-English speech into English text
Additional Capabilities
Whisper may exhibit additional capabilities, particularly if fine-tuned:- Voice activity detection
- Speaker classification
- Speaker diarization
Appropriate Use
Recommended uses:
- Transcribing and translating speech
- Improving accessibility tools
- Research on speech processing robustness
- Development of near-real-time speech recognition applications
Training Data
Whisper models are trained on 680,000 hours of audio with corresponding transcripts collected from the internet.Data Distribution
- 65% (438,000 hours): English audio + English transcripts
- 18% (126,000 hours): Non-English audio + English transcripts (for translation)
- 17% (117,000 hours): Non-English audio + native transcripts in 98 different languages
Training Methodology
Models are trained using large-scale weak supervision on diverse audio data from the internet.Performance on transcription in a given language is directly correlated with the amount of training data in that language.
Performance and Limitations
Strengths
- Improved robustness to accents, background noise, and technical language
- Zero-shot translation from multiple languages into English
- Near state-of-the-art accuracy on speech recognition and translation
Known Limitations
Hallucinations
Predictions may include text not actually spoken in the audio input. This occurs because models combine:- Predicting the next word based on language knowledge
- Transcribing the actual audio
Uneven Language Performance
- Lower accuracy on low-resource languages with less training data
- Disparate performance across different accents and dialects
- Potential higher word error rates across speakers of different genders, races, ages, or demographic groups
Repetitive Text Generation
The sequence-to-sequence architecture can generate repetitive text. Mitigation strategies:- Beam search
- Temperature scheduling
These mitigations help but do not perfectly eliminate repetitions.
Full Evaluation
Complete evaluation results are available in the accompanying paper.Broader Implications
Beneficial Applications
- Accessibility: Improving tools for users with hearing impairments
- Near-real-time applications: Building speech recognition and translation applications
- Democratized ASR: Making speech recognition more accessible to developers