Skip to main content
Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It’s a multitasking model capable of performing multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

Transformer Sequence-to-Sequence Approach

Approach Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks. This architecture allows a single model to replace many stages of a traditional speech-processing pipeline.

Unified Task Representation

All tasks are jointly represented as a sequence of tokens to be predicted by the decoder, including:
  • Multilingual speech recognition
  • Speech translation
  • Spoken language identification
  • Voice activity detection
The multitask training format uses a set of special tokens that serve as task specifiers or classification targets, enabling seamless switching between different speech processing tasks.

Multitask Training

Whisper’s training approach combines multiple speech processing tasks into a single model:

Speech Recognition

Transcribe audio to text in the same language

Translation

Translate foreign language speech to English

Language Detection

Identify the spoken language in audio

Voice Activity

Detect when speech is present in audio

How It Works

The transcribe() method processes audio using:
  1. 30-second sliding window: Audio is processed in overlapping segments
  2. Autoregressive predictions: The model generates text token by token
  3. Special tokens: Task-specific tokens guide the model’s behavior
import whisper

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])

Key Features

Six model sizes available, from tiny (39M parameters) to large (1550M parameters), offering different speed and accuracy tradeoffs.
Whisper supports transcription and translation for 99 languages with varying performance levels.
Specialized English-only variants (.en models) for better performance on English audio.
The turbo model offers 8x faster transcription with minimal accuracy loss compared to large-v3.
The turbo model is not trained for translation tasks. Use the multilingual models (tiny, base, small, medium, large) for translation.

Build docs developers (and LLMs) love