Skip to main content
Whisper offers six model sizes with different tradeoffs between speed, accuracy, and resource requirements. Four models have English-only versions optimized for English speech.

Available Models

Relative speeds are measured by transcribing English speech on an A100 GPU. Real-world performance varies based on language, speaking speed, and hardware.

Model Specifications

SizeParametersEnglish-onlyMultilingualRequired VRAMRelative Speed
tiny39 Mtiny.entiny~1 GB~10x
base74 Mbase.enbase~1 GB~7x
small244 Msmall.ensmall~2 GB~4x
medium769 Mmedium.enmedium~5 GB~2x
large1550 MN/Alarge~10 GB1x
turbo809 MN/Aturbo~6 GB~8x

Model Variants

Tiny (39M parameters)

Best for: Real-time applications with limited resources
  • VRAM: ~1 GB
  • Speed: 10x faster than large
  • Models: tiny.en (English), tiny (Multilingual)
model = whisper.load_model("tiny")
# or for English-only
model = whisper.load_model("tiny.en")

Base (74M parameters)

Best for: Fast transcription with acceptable accuracy
  • VRAM: ~1 GB
  • Speed: 7x faster than large
  • Models: base.en (English), base (Multilingual)
model = whisper.load_model("base")
# or for English-only
model = whisper.load_model("base.en")

Small (244M parameters)

Best for: Balanced performance and resource usage
  • VRAM: ~2 GB
  • Speed: 4x faster than large
  • Models: small.en (English), small (Multilingual)
model = whisper.load_model("small")
# or for English-only
model = whisper.load_model("small.en")

Medium (769M parameters)

Best for: High accuracy with moderate speed
  • VRAM: ~5 GB
  • Speed: 2x faster than large
  • Models: medium.en (English), medium (Multilingual)
model = whisper.load_model("medium")
# or for English-only
model = whisper.load_model("medium.en")

Large (1550M parameters)

Best for: Maximum accuracy, translation tasks
  • VRAM: ~10 GB
  • Speed: Baseline (1x)
  • Models: large (Multilingual only)
  • Versions: large-v1, large-v2, large-v3
model = whisper.load_model("large")
# or specific version
model = whisper.load_model("large-v3")
The large model alias points to large-v3, the latest version.

Turbo (809M parameters)

Best for: Fast, accurate transcription (default model)
  • VRAM: ~6 GB
  • Speed: 8x faster than large
  • Models: turbo (Multilingual only)
  • Based on: Optimized large-v3
model = whisper.load_model("turbo")
The turbo model is not trained for translation tasks. Use medium or large models for translating speech to English.

English-only vs Multilingual

When to Use English-only Models

The .en models perform better on English audio, especially for tiny.en and base.en. The difference becomes less significant for larger models.
English-only models are available for: tiny, base, small, and medium sizes.

When to Use Multilingual Models

Required for:
  • Non-English transcription
  • Translation to English
  • Language identification
  • Multilingual applications

Model Selection Guide

Loading Models

Basic Loading

import whisper

model = whisper.load_model("turbo")

With Device Selection

import whisper

# Use specific GPU
model = whisper.load_model("turbo", device="cuda:0")

# Use CPU
model = whisper.load_model("turbo", device="cpu")

Custom Download Location

import whisper

model = whisper.load_model(
    "turbo",
    download_root="/path/to/models"
)

Available Models

import whisper

# Get list of all available models
models = whisper.available_models()
print(models)
# ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 
#  'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 
#  'large', 'large-v3-turbo', 'turbo']

Model Versions

The large model has multiple versions with improvements:
  • large-v1: Original large model
  • large-v2: Improved accuracy
  • large-v3: Latest version with best accuracy
  • turbo: Optimized large-v3 for speed
Use large (alias for large-v3) or turbo for new projects to get the latest improvements.

Build docs developers (and LLMs) love