Python API

Overview

Whisper provides a clean Python API for integrating speech recognition and translation into your applications. This guide covers the main functions and common usage patterns.

Basic Transcription

Import and load model

import whisper

model = whisper.load_model("turbo")

Available models: tiny, base, small, medium, large, turbo, or English-only variants (tiny.en, base.en, small.en, medium.en).

Transcribe audio

result = model.transcribe("audio.mp3")
print(result["text"])

The `transcribe()` Function

The transcribe() method is the primary interface for audio transcription.

Function Signature

def transcribe(
    model: Whisper,
    audio: Union[str, np.ndarray, torch.Tensor],
    *,
    verbose: Optional[bool] = None,
    temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold: Optional[float] = 2.4,
    logprob_threshold: Optional[float] = -1.0,
    no_speech_threshold: Optional[float] = 0.6,
    condition_on_previous_text: bool = True,
    initial_prompt: Optional[str] = None,
    carry_initial_prompt: bool = False,
    word_timestamps: bool = False,
    prepend_punctuations: str = "\"'¿([{-",
    append_punctuations: str = "\"\'。,，!！?？:：\")]}、",
    clip_timestamps: Union[str, List[float]] = "0",
    hallucination_silence_threshold: Optional[float] = None,
    **decode_options,
)

Parameters

Basic
Quality Control
Timestamps
Context

audio: Path to audio file, or audio waveform as NumPy array or PyTorch tensor
verbose: Display transcription progress (True, False, or None for silent)
language: Specify the spoken language (e.g., "en", "ja", "es")
task: Either "transcribe" or "translate" (default: "transcribe")

temperature: Sampling temperature or tuple of temperatures for fallback (default: (0.0, 0.2, 0.4, 0.6, 0.8, 1.0))
compression_ratio_threshold: Retry if gzip compression ratio exceeds this (default: 2.4)
logprob_threshold: Retry if average log probability is below this (default: -1.0)
no_speech_threshold: Probability threshold for detecting silence (default: 0.6)

word_timestamps: Extract word-level timestamps using cross-attention (default: False)
prepend_punctuations: Punctuation marks to merge with the next word
append_punctuations: Punctuation marks to merge with the previous word
clip_timestamps: List of [start, end] timestamps to process specific clips
hallucination_silence_threshold: Skip silent periods when hallucination detected

condition_on_previous_text: Use previous output as context (default: True)
initial_prompt: Text prompt for the first window to guide transcription
carry_initial_prompt: Prepend initial_prompt to every decode call

Return Value

Returns a dictionary containing:

{
    "text": str,              # Full transcription text
    "segments": List[dict],   # List of segments with timestamps and metadata
    "language": str           # Detected or specified language code
}

Each segment contains:

{
    "id": int,
    "seek": int,
    "start": float,
    "end": float,
    "text": str,
    "tokens": List[int],
    "temperature": float,
    "avg_logprob": float,
    "compression_ratio": float,
    "no_speech_prob": float,
    "words": List[dict]  # Only present if word_timestamps=True
}

Usage Examples

Specify Language

import whisper

model = whisper.load_model("turbo")
result = model.transcribe("japanese.wav", language="ja")
print(result["text"])

Get Segment Details

result = model.transcribe("audio.mp3")

for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")

Use Initial Prompt for Custom Vocabulary

result = model.transcribe(
    "technical_talk.mp3",
    initial_prompt="This is a discussion about PyTorch, CUDA, and neural networks."
)

The initial_prompt parameter is useful for improving accuracy with domain-specific vocabulary, proper nouns, or technical terms.

Extract Word-Level Timestamps

result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    for word in segment["words"]:
        print(f"{word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

Process Specific Time Ranges

# Process audio from 30-60s and 120-180s
result = model.transcribe(
    "long_audio.mp3",
    clip_timestamps=[30, 60, 120, 180]
)

Disable Context Conditioning

# More robust to repetition loops, but less consistent across segments
result = model.transcribe(
    "audio.mp3",
    condition_on_previous_text=False
)

Lower-Level Access

For more control, use detect_language() and decode() functions:

Language Detection

import whisper

model = whisper.load_model("turbo")

# Load and preprocess audio
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# Generate Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Detect language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")
print(f"Probability: {probs[detected_language]:.2%}")

Manual Decoding

import whisper
from whisper.decoding import DecodingOptions, decode

model = whisper.load_model("turbo")

# Load and preprocess audio
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# Generate Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Configure decoding options
options = DecodingOptions(
    language="en",
    task="transcribe",
    temperature=0.0,
    fp16=True
)

# Decode the audio
result = decode(model, mel, options)
print(result.text)

The transcribe() function internally handles sliding windows and processes the entire audio file. The decode() function only processes a single 30-second segment.

Model Loading Options

Load from Specific Directory

model = whisper.load_model("turbo", download_root="/path/to/models")

By default, models are cached in ~/.cache/whisper/.

Device Selection

import torch

# Load on GPU
model = whisper.load_model("turbo", device="cuda")

# Load on CPU
model = whisper.load_model("turbo", device="cpu")

# Auto-select
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("turbo", device=device)

Batch Processing

import whisper
import os
from pathlib import Path

model = whisper.load_model("turbo")
audio_dir = Path("./audio_files")

for audio_file in audio_dir.glob("*.mp3"):
    print(f"Transcribing {audio_file.name}...")
    result = model.transcribe(str(audio_file))
    
    # Save transcription
    output_file = audio_dir / f"{audio_file.stem}.txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(result["text"])

Audio Preprocessing

Whisper includes utility functions for audio processing:

import whisper
import numpy as np

# Load audio file (resampled to 16kHz)
audio = whisper.load_audio("audio.mp3")

# Pad or trim to 30 seconds
audio_30s = whisper.pad_or_trim(audio)

# Generate Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=80)

print(f"Audio shape: {audio.shape}")  # (samples,)
print(f"Mel shape: {mel.shape}")      # (80, frames)

Audio must be resampled to 16kHz mono. The whisper.load_audio() function handles this automatically using FFmpeg.

Get Started

Core Concepts

Guides

Resources

Overview

Basic Transcription

The `transcribe()` Function

Function Signature

Parameters

Return Value

Usage Examples

Specify Language

Get Segment Details

Use Initial Prompt for Custom Vocabulary

Extract Word-Level Timestamps

Process Specific Time Ranges

Disable Context Conditioning

Lower-Level Access

Language Detection

Manual Decoding

Model Loading Options

Load from Specific Directory

Device Selection

Batch Processing

Audio Preprocessing

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Resources

​Overview

​Basic Transcription

​The transcribe() Function

​Function Signature

​Parameters

​Return Value

​Usage Examples

​Specify Language

​Get Segment Details

​Use Initial Prompt for Custom Vocabulary

​Extract Word-Level Timestamps

​Process Specific Time Ranges

​Disable Context Conditioning

​Lower-Level Access

​Language Detection

​Manual Decoding

​Model Loading Options

​Load from Specific Directory

​Device Selection

​Batch Processing

​Audio Preprocessing

Build docs developers (and LLMs) love

Overview

Basic Transcription

The `transcribe()` Function

Function Signature

Parameters

Return Value

Usage Examples

Specify Language

Get Segment Details

Use Initial Prompt for Custom Vocabulary

Extract Word-Level Timestamps

Process Specific Time Ranges

Disable Context Conditioning

Lower-Level Access

Language Detection

Manual Decoding

Model Loading Options

Load from Specific Directory

Device Selection

Batch Processing

Audio Preprocessing