Function Signature
Parameters
The Whisper model instance returned by
load_model().The path to the audio file to open, or the audio waveform as a NumPy array or PyTorch tensor.
- File path: String path to audio file (supports most formats via ffmpeg)
- NumPy array: Float32 array with values in [-1.0, 1.0], sampled at 16kHz
- PyTorch Tensor: Float32 tensor with values in [-1.0, 1.0], sampled at 16kHz
Controls console output during transcription:
True: Display all details including timestamps and text as decodedFalse: Display minimal details (progress bar only)None: No display output
Temperature for sampling. Can be a single float or tuple of temperatures.When a tuple is provided, temperatures are tried sequentially upon failures determined by
compression_ratio_threshold or logprob_threshold.0.0: Greedy decoding (deterministic, most accurate)> 0.0: Sampling (more creative, less deterministic)
If the gzip compression ratio is above this value, treat the decoding as failed and try next temperature.High compression ratios indicate repetitive text, suggesting decoding failure.
If the average log probability over sampled tokens is below this value, treat the decoding as failed.Low log probabilities indicate low confidence in the transcription.
If the no_speech probability is higher than this value AND the average log probability is below
logprob_threshold, consider the segment as silent.This helps skip segments with no speech activity.If
True, provide the previous output of the model as a prompt for the next window.- Advantage: More consistent text across windows
- Disadvantage: Model may get stuck in failure loops (repetition, timestamps out of sync)
False if experiencing repetition issues.Optional text to provide as a prompt for the first window.Use cases:
- Provide context or domain-specific vocabulary
- Guide spelling of proper nouns or technical terms
- Set the style or format of transcription
"This is a medical lecture about cardiology."If
True, prepend initial_prompt to the prompt of each internal decode() call.- When
True: Initial prompt persists throughout entire transcription - When
False: Initial prompt only affects first window
Extract word-level timestamps using cross-attention pattern and dynamic time warping.When
True, each segment includes a words field with per-word timing.Note: Adds computational overhead. Word-level timestamps on translations may not be reliable.If
word_timestamps is True, merge these punctuation symbols with the next word.Example: Opening quotes, brackets are merged forward.If
word_timestamps is True, merge these punctuation symbols with the previous word.Example: Closing quotes, periods are merged backward.Comma-separated list or list of floats specifying start,end,start,end,… timestamps (in seconds) of clips to process.The last end timestamp defaults to the end of the file.Examples:
"0,30,60,90": Process 0-30s and 60-90s[10.5, 45.2]: Process 10.5-45.2s
When
word_timestamps is True, skip silent periods longer than this threshold (in seconds) when a possible hallucination is detected.Helps prevent the model from generating text during silence.Additional keyword arguments to construct
DecodingOptions instances. Common options:language(str): Language code (e.g.,"en","fr"). Auto-detected ifNone.task(str): Either"transcribe"(default) or"translate"(to English)fp16(bool): Use FP16 for inference. DefaultTrueon CUDA,Falseon CPU.beam_size(int): Number of beams in beam search (only whentemperature=0)best_of(int): Number of candidates when sampling with non-zero temperaturepatience(float): Patience value for beam searchlength_penalty(float): Length penalty coefficient (alpha)suppress_tokens(str): Comma-separated token IDs to suppress
Returns
A dictionary containing:
Example
Notes
Language Detection
Iflanguage is not specified in decode_options:
- Multilingual models automatically detect language from the first 30 seconds
- English-only models (
.en) always use English - Detection results are shown when
verbose=True
Temperature Fallback Strategy
When multiple temperatures are provided (default behavior):- Starts with lowest temperature (greedy decoding)
- If decoding fails quality checks, tries next temperature
- Continues until acceptable result or all temperatures exhausted
- Compression ratio below threshold
- Average log probability above threshold
- No-speech detection
Performance Considerations
- 30-second chunks: Audio is processed in 30-second windows
- Word timestamps: Adds ~20-30% processing time
- Beam search: Slower than greedy but more accurate (use with
temperature=0, beam_size=5) - FP16: 2x faster on CUDA, not supported on CPU
Hallucination Detection
Whenword_timestamps=True and hallucination_silence_threshold is set:
- Detects anomalous words (very short/long duration, low probability)
- Skips segments surrounded by silence that appear to be hallucinations
- Helps prevent fabricated text in silent portions
Task Types
task="transcribe": Speech-to-text in original language (X→X)task="translate": Speech-to-text translated to English (X→EN)
Common Issues
Repetition loops: Setcondition_on_previous_text=False
Poor quality on specific domains: Use initial_prompt with relevant vocabulary
Timestamps out of sync: Try word_timestamps=True for better alignment
Processing too slow: Use smaller model, disable word timestamps, or use FP16 on GPU