DecodingOptions

A frozen dataclass that contains all configuration options for decoding 30-second audio segments. Used by the decode() function to control decoding behavior.

@dataclass(frozen=True)
class DecodingOptions:
    task: str = "transcribe"
    language: Optional[str] = None
    temperature: float = 0.0
    sample_len: Optional[int] = None
    best_of: Optional[int] = None
    beam_size: Optional[int] = None
    patience: Optional[float] = None
    length_penalty: Optional[float] = None
    prompt: Optional[Union[str, List[int]]] = None
    prefix: Optional[Union[str, List[int]]] = None
    suppress_tokens: Optional[Union[str, Iterable[int]]] = "-1"
    suppress_blank: bool = True
    without_timestamps: bool = False
    max_initial_timestamp: Optional[float] = 1.0
    fp16: bool = True

Fields

Task and Language

task

str

default:"transcribe"

Whether to perform transcription ("transcribe") or translation to English ("translate")

language

Optional[str]

default:"None"

Language code of the audio (e.g., "en", "fr", "ja"). If None, language is detected automatically.

Sampling Strategy

temperature

float

default:"0.0"

Sampling temperature. Use 0.0 for greedy decoding (deterministic), or values like 0.2-1.0 for stochastic sampling.

sample_len

Optional[int]

default:"None"

Maximum number of tokens to sample. Defaults to n_text_ctx // 2 if not specified.

best_of

Optional[int]

default:"None"

Number of independent samples to generate when using stochastic sampling (temperature > 0). The best one is selected based on log probability.

beam_size

Optional[int]

default:"None"

Number of beams for beam search when using greedy decoding (temperature = 0). Cannot be used with best_of.

patience

Optional[float]

default:"None"

Beam search patience factor as described in arxiv:2204.05424. Requires beam_size to be set.

length_penalty

Optional[float]

default:"None"

“Alpha” parameter for length penalty in Google NMT. Use None for simple length normalization. Should be between 0 and 1.

Prompting and Context

prompt

Optional[Union[str, List[int]]]

default:"None"

Text or token IDs to provide as context from previous audio. Helps with consistency across segments. See discussion for details.

prefix

Optional[Union[str, List[int]]]

default:"None"

Text or token IDs to prefix the current segment with. Forces the transcription to start with specific text.

Token Suppression

suppress_tokens

Optional[Union[str, Iterable[int]]]

default:"-1"

List of token IDs to suppress, or comma-separated string. Use "-1" to suppress non-speech tokens as defined in tokenizer.non_speech_tokens().

suppress_blank

bool

default:"True"

Suppress blank outputs at the beginning of sampling

Timestamp Options

without_timestamps

bool

default:"False"

Use <|notimestamps|> token to sample text tokens only, without any timestamp tokens

max_initial_timestamp

Optional[float]

default:"1.0"

Maximum allowed timestamp (in seconds) for the first token. Prevents the model from starting too late in the audio.

Implementation Details

fp16

bool

default:"True"

Use 16-bit floating point precision for most calculations. Set to False if running on CPU or if encountering numerical issues.

Usage Examples

Basic Transcription

from whisper import load_model
from whisper.decoding import DecodingOptions, decode
from whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram

# Load model and audio
model = load_model("base")
audio = load_audio("audio.mp3")
mel = log_mel_spectrogram(pad_or_trim(audio))

# Default options (greedy decoding)
options = DecodingOptions()
result = decode(model, mel, options)
print(result.text)

Translation to English

# Translate foreign language audio to English
options = DecodingOptions(
    task="translate",
    language="ja"  # Japanese audio
)
result = decode(model, mel, options)
print(result.text)  # Output in English

Beam Search Decoding

# Use beam search for potentially better results
options = DecodingOptions(
    beam_size=5,
    patience=1.0,
    length_penalty=0.8
)
result = decode(model, mel, options)

Stochastic Sampling

# Generate multiple candidates and pick the best
options = DecodingOptions(
    temperature=0.2,
    best_of=5
)
result = decode(model, mel, options)

Using Prompts for Context

# Provide context from previous segment
options = DecodingOptions(
    prompt="The speaker was discussing machine learning concepts.",
    language="en"
)
result = decode(model, mel, options)

Text-Only Output (No Timestamps)

# Disable timestamps for plain text
options = DecodingOptions(
    without_timestamps=True
)
result = decode(model, mel, options)

Custom Token Suppression

# Suppress specific tokens
options = DecodingOptions(
    suppress_tokens=[1, 2, 7, 8, 9],  # Specific token IDs
    suppress_blank=True
)
result = decode(model, mel, options)

Using Kwargs Shortcut

# Pass options as kwargs directly to decode()
result = decode(
    model,
    mel,
    language="en",
    task="transcribe",
    temperature=0.0
)

Notes

When to Use Each Option

Greedy decoding (temperature=0.0): Fastest and most deterministic, good for most use cases
Beam search (beam_size=5): Better quality for challenging audio, slower than greedy
Stochastic sampling (temperature>0, best_of>1): Useful for creative applications or when you want variation

Constraints

Cannot use beam_size and best_of together
patience requires beam_size to be set
length_penalty should be between 0 and 1 if specified
best_of is incompatible with greedy sampling (temperature=0)

Performance Tips

Set fp16=False when running on CPU for better compatibility
Lower beam_size for faster decoding at the cost of quality
Use without_timestamps=True if you don’t need timestamp information
The dataclass is frozen (immutable) - create a new instance to change options

Language Codes

Common language codes include: "en" (English), "zh" (Chinese), "es" (Spanish), "fr" (French), "de" (German), "ja" (Japanese), "ko" (Korean), "ru" (Russian), "ar" (Arabic), "hi" (Hindi)

Core Functions

Audio Processing

Model Classes

Utilities

DecodingOptions