DecodingOptions
A frozen dataclass that contains all configuration options for decoding 30-second audio segments. Used by thedecode() function to control decoding behavior.
Fields
Task and Language
Whether to perform transcription (
"transcribe") or translation to English ("translate")Language code of the audio (e.g.,
"en", "fr", "ja"). If None, language is detected automatically.Sampling Strategy
Sampling temperature. Use
0.0 for greedy decoding (deterministic), or values like 0.2-1.0 for stochastic sampling.Maximum number of tokens to sample. Defaults to
n_text_ctx // 2 if not specified.Number of independent samples to generate when using stochastic sampling (temperature > 0). The best one is selected based on log probability.
Number of beams for beam search when using greedy decoding (temperature = 0). Cannot be used with
best_of.Beam search patience factor as described in arxiv:2204.05424. Requires
beam_size to be set.“Alpha” parameter for length penalty in Google NMT. Use
None for simple length normalization. Should be between 0 and 1.Prompting and Context
Text or token IDs to provide as context from previous audio. Helps with consistency across segments. See discussion for details.
Text or token IDs to prefix the current segment with. Forces the transcription to start with specific text.
Token Suppression
List of token IDs to suppress, or comma-separated string. Use
"-1" to suppress non-speech tokens as defined in tokenizer.non_speech_tokens().Suppress blank outputs at the beginning of sampling
Timestamp Options
Use
<|notimestamps|> token to sample text tokens only, without any timestamp tokensMaximum allowed timestamp (in seconds) for the first token. Prevents the model from starting too late in the audio.
Implementation Details
Use 16-bit floating point precision for most calculations. Set to
False if running on CPU or if encountering numerical issues.Usage Examples
Basic Transcription
Translation to English
Beam Search Decoding
Stochastic Sampling
Using Prompts for Context
Text-Only Output (No Timestamps)
Custom Token Suppression
Using Kwargs Shortcut
Notes
When to Use Each Option
- Greedy decoding (
temperature=0.0): Fastest and most deterministic, good for most use cases - Beam search (
beam_size=5): Better quality for challenging audio, slower than greedy - Stochastic sampling (
temperature>0,best_of>1): Useful for creative applications or when you want variation
Constraints
- Cannot use
beam_sizeandbest_oftogether patiencerequiresbeam_sizeto be setlength_penaltyshould be between 0 and 1 if specifiedbest_ofis incompatible with greedy sampling (temperature=0)
Performance Tips
- Set
fp16=Falsewhen running on CPU for better compatibility - Lower
beam_sizefor faster decoding at the cost of quality - Use
without_timestamps=Trueif you don’t need timestamp information - The dataclass is frozen (immutable) - create a new instance to change options
Language Codes
Common language codes include:"en" (English), "zh" (Chinese), "es" (Spanish), "fr" (French), "de" (German), "ja" (Japanese), "ko" (Korean), "ru" (Russian), "ar" (Arabic), "hi" (Hindi)