Skip to main content

Overview

Whisper provides a clean Python API for integrating speech recognition and translation into your applications. This guide covers the main functions and common usage patterns.

Basic Transcription

1

Import and load model

import whisper

model = whisper.load_model("turbo")
Available models: tiny, base, small, medium, large, turbo, or English-only variants (tiny.en, base.en, small.en, medium.en).
2

Transcribe audio

result = model.transcribe("audio.mp3")
print(result["text"])

The transcribe() Function

The transcribe() method is the primary interface for audio transcription.

Function Signature

def transcribe(
    model: Whisper,
    audio: Union[str, np.ndarray, torch.Tensor],
    *,
    verbose: Optional[bool] = None,
    temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold: Optional[float] = 2.4,
    logprob_threshold: Optional[float] = -1.0,
    no_speech_threshold: Optional[float] = 0.6,
    condition_on_previous_text: bool = True,
    initial_prompt: Optional[str] = None,
    carry_initial_prompt: bool = False,
    word_timestamps: bool = False,
    prepend_punctuations: str = "\"'¿([{-",
    append_punctuations: str = "\"\'。,,!!??::\")]}、",
    clip_timestamps: Union[str, List[float]] = "0",
    hallucination_silence_threshold: Optional[float] = None,
    **decode_options,
)

Parameters

  • audio: Path to audio file, or audio waveform as NumPy array or PyTorch tensor
  • verbose: Display transcription progress (True, False, or None for silent)
  • language: Specify the spoken language (e.g., "en", "ja", "es")
  • task: Either "transcribe" or "translate" (default: "transcribe")

Return Value

Returns a dictionary containing:
{
    "text": str,              # Full transcription text
    "segments": List[dict],   # List of segments with timestamps and metadata
    "language": str           # Detected or specified language code
}
Each segment contains:
{
    "id": int,
    "seek": int,
    "start": float,
    "end": float,
    "text": str,
    "tokens": List[int],
    "temperature": float,
    "avg_logprob": float,
    "compression_ratio": float,
    "no_speech_prob": float,
    "words": List[dict]  # Only present if word_timestamps=True
}

Usage Examples

Specify Language

import whisper

model = whisper.load_model("turbo")
result = model.transcribe("japanese.wav", language="ja")
print(result["text"])

Get Segment Details

result = model.transcribe("audio.mp3")

for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")

Use Initial Prompt for Custom Vocabulary

result = model.transcribe(
    "technical_talk.mp3",
    initial_prompt="This is a discussion about PyTorch, CUDA, and neural networks."
)
The initial_prompt parameter is useful for improving accuracy with domain-specific vocabulary, proper nouns, or technical terms.

Extract Word-Level Timestamps

result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    for word in segment["words"]:
        print(f"{word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

Process Specific Time Ranges

# Process audio from 30-60s and 120-180s
result = model.transcribe(
    "long_audio.mp3",
    clip_timestamps=[30, 60, 120, 180]
)

Disable Context Conditioning

# More robust to repetition loops, but less consistent across segments
result = model.transcribe(
    "audio.mp3",
    condition_on_previous_text=False
)

Lower-Level Access

For more control, use detect_language() and decode() functions:

Language Detection

import whisper

model = whisper.load_model("turbo")

# Load and preprocess audio
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# Generate Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Detect language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")
print(f"Probability: {probs[detected_language]:.2%}")

Manual Decoding

import whisper
from whisper.decoding import DecodingOptions, decode

model = whisper.load_model("turbo")

# Load and preprocess audio
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# Generate Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Configure decoding options
options = DecodingOptions(
    language="en",
    task="transcribe",
    temperature=0.0,
    fp16=True
)

# Decode the audio
result = decode(model, mel, options)
print(result.text)
The transcribe() function internally handles sliding windows and processes the entire audio file. The decode() function only processes a single 30-second segment.

Model Loading Options

Load from Specific Directory

model = whisper.load_model("turbo", download_root="/path/to/models")
By default, models are cached in ~/.cache/whisper/.

Device Selection

import torch

# Load on GPU
model = whisper.load_model("turbo", device="cuda")

# Load on CPU
model = whisper.load_model("turbo", device="cpu")

# Auto-select
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("turbo", device=device)

Batch Processing

import whisper
import os
from pathlib import Path

model = whisper.load_model("turbo")
audio_dir = Path("./audio_files")

for audio_file in audio_dir.glob("*.mp3"):
    print(f"Transcribing {audio_file.name}...")
    result = model.transcribe(str(audio_file))
    
    # Save transcription
    output_file = audio_dir / f"{audio_file.stem}.txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(result["text"])

Audio Preprocessing

Whisper includes utility functions for audio processing:
import whisper
import numpy as np

# Load audio file (resampled to 16kHz)
audio = whisper.load_audio("audio.mp3")

# Pad or trim to 30 seconds
audio_30s = whisper.pad_or_trim(audio)

# Generate Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=80)

print(f"Audio shape: {audio.shape}")  # (samples,)
print(f"Mel shape: {mel.shape}")      # (80, frames)
Audio must be resampled to 16kHz mono. The whisper.load_audio() function handles this automatically using FFmpeg.

Build docs developers (and LLMs) love