Skip to main content

Model Sizes and Speed

Whisper offers six model sizes with different speed and accuracy tradeoffs. Below are the available models and their approximate memory requirements and inference speed relative to the large model.
The relative speeds are measured by transcribing English speech on an A100 GPU. Real-world speed may vary significantly depending on factors including the language, speaking speed, and available hardware.
SizeParametersEnglish-onlyMultilingualRequired VRAMRelative Speed
tiny39 Mtiny.entiny~1 GB~10x
base74 Mbase.enbase~1 GB~7x
small244 Msmall.ensmall~2 GB~4x
medium769 Mmedium.enmedium~5 GB~2x
large1550 MN/Alarge~10 GB1x
turbo809 MN/Aturbo~6 GB~8x

Model Selection Guidance

English-only models (.en suffix) tend to perform better for English applications, especially for tiny.en and base.en. The performance difference becomes less significant for small.en and medium.en. Turbo model is an optimized version of large-v3 that offers faster transcription speed with minimal degradation in accuracy. Note that the turbo model is not trained for translation tasks.

Accuracy by Language

Whisper’s performance varies widely depending on the language. The chart below shows a performance breakdown of large-v3 and large-v2 models by language, using WER (Word Error Rates) or CER (Character Error Rates, shown in italic).
Metrics are evaluated on the Common Voice 15 and Fleurs datasets. Lower WER/CER values indicate better performance.
WER breakdown by language

Additional Metrics

For comprehensive performance data:
  • WER/CER metrics for all models and datasets: See Appendix D.1, D.2, and D.4 of the paper
  • Translation BLEU scores: See Appendix D.3 of the paper

Performance Tradeoffs

Speed vs Accuracy

  • Fastest: tiny and turbo models provide the quickest transcription but with lower accuracy
  • Balanced: small and medium models offer good performance for most use cases
  • Most Accurate: large models provide the best accuracy but require more VRAM and processing time

VRAM Requirements

Minimum VRAM needed for each model:
  • 1 GB: Sufficient for tiny and base models
  • 2 GB: Required for small model
  • 5 GB: Required for medium model
  • 6 GB: Required for turbo model
  • 10 GB: Required for large model
Ensure your GPU has sufficient VRAM before selecting a model. Out-of-memory errors will occur if VRAM is insufficient.

Language-Specific Performance

Performance is directly correlated with the amount of training data available for each language:
  • High-resource languages (e.g., English, Spanish, French): Best performance with lower WER
  • Medium-resource languages: Moderate performance
  • Low-resource languages: Higher WER and potential accuracy issues

Training Data Distribution

The models are trained on 680,000 hours of audio:
  • 65% (438,000 hours): English audio with English transcripts
  • 18% (126,000 hours): Non-English audio with English transcripts (translation)
  • 17% (117,000 hours): Non-English audio with native transcripts (98 languages)

Robustness

Whisper models exhibit improved robustness compared to many existing ASR systems:
  • Accents: Better handling of diverse accents
  • Background noise: Improved performance in noisy environments
  • Technical language: Better recognition of domain-specific terminology
  • Zero-shot translation: Can translate from multiple languages into English without specific fine-tuning

Build docs developers (and LLMs) love