Function Signature
Parameters
The Whisper model instance returned by
load_model().A tensor containing the Mel spectrogram(s).Shape:
(80, 3000) for single segment or (*, 80, 3000) for batched segments- 80: Number of mel frequency bins
- 3000: Number of frames (corresponds to 30 seconds of audio)
whisper.log_mel_spectrogram().A dataclass instance containing all decoding options. Can be constructed with specific parameters or modified using
dataclasses.replace().See the DecodingOptions section below for all available fields.Additional keyword arguments that override fields in the
options parameter.Example: decode(model, mel, options, temperature=0.5) overrides the temperature in options.DecodingOptions
DecodingOptions is a frozen dataclass with the following fields:
Either
"transcribe" (X→X speech recognition) or "translate" (X→English translation).Language code for the audio (e.g.,
"en", "fr", "es"). Uses detected language if None.Temperature for sampling:
0.0: Greedy decoding (deterministic)> 0.0: Sampling (stochastic)
Maximum number of tokens to sample. Defaults to
model.dims.n_text_ctx // 2 if not specified.Number of independent sample trajectories when
temperature > 0.The result with highest average log probability is selected.Number of beams in beam search (only applicable when
temperature == 0).Typical values: 3-5. Higher values are more accurate but slower.Patience value for beam search as described in arxiv:2204.05424.Default (
1.0) is equivalent to conventional beam search.Token length penalty coefficient (alpha) as in arxiv:1609.08144.
None: Simple length normalization (divide by length)0.0 - 1.0: Google NMT-style length penalty
Text or token IDs to provide as context from previous audio window.See discussion for more details.
Text or token IDs to prefix the current context with.Forces the model to begin output with specific text.
List of token IDs (or comma-separated string) to suppress during sampling.
"-1": Suppress most special characters except common punctuation (default)""orNone: No suppression- Custom: e.g.,
"1,2,3"or[1, 2, 3]
Suppress blank outputs at the beginning of sampling.Prevents the model from generating space or silence tokens first.
Use
<|notimestamps|> token to sample text tokens only, without timestamp tokens.Maximum allowed timestamp (in seconds) for the first generated timestamp.Prevents the model from skipping too far ahead at the start.
Use FP16 (half precision) for most calculations.
- Significantly faster on CUDA GPUs
- Automatically disabled on CPU (FP16 not supported)
Returns
Returns a single
DecodingResult if input is 2D (80, 3000), or a list of DecodingResult if input is batched.Example
Notes
Decode vs Transcribe
decode() is a lower-level function compared to transcribe():
- decode(): Works on 30-second Mel spectrogram segments
- transcribe(): Handles full audio files of any length, splits into segments, applies fallback strategies
decode() when:
- You need fine-grained control over single segments
- You’re implementing custom audio processing pipelines
- You want to handle batching manually
transcribe() when:
- Processing complete audio files
- You want automatic handling of long audio
- You need word-level timestamps and segment management
Sampling Strategies
Greedy Decoding (temperature=0):
- Deterministic (same input → same output)
- Fastest
- Best for most use cases
temperature=0, beam_size>1):
- More accurate than greedy
- 5x slower than greedy (with beam_size=5)
- Good for high-quality transcription
temperature>0):
- Non-deterministic
- More creative/varied outputs
- Use
best_ofto select best among multiple samples
KV Cache
decode() automatically manages key-value cache for the decoder:
- First token: Full forward pass
- Subsequent tokens: Only process new token (much faster)
- Cache is cleaned up after decoding completes
- Cache is rearranged for beam search
Prompt Engineering
Using prompts for context:prompt field helps maintain consistency across segments by providing the model with context from previous audio.
Token Suppression
By default,suppress_tokens="-1" suppresses:
- Special tokens (SOT, EOT, etc.)
- Most non-speech tokens
- Common punctuation is NOT suppressed
Performance Tips
- Use FP16 on GPU: ~2x faster with minimal quality loss
- Batch processing: Process multiple segments in parallel
- Adjust sample_len: Reduce if you expect short outputs
- Temperature=0: Fastest, most accurate for clear audio
- Disable beam search: Use greedy for speed-critical applications