Skip to main content
This guide will help you transcribe your first audio file using Whisper’s command-line interface and Python API.

Command-Line Usage

Basic Transcription

The simplest way to use Whisper is via the command line. The turbo model is used by default for optimal speed and accuracy:
whisper audio.mp3
You can transcribe multiple files at once:
whisper audio.flac audio.mp3 audio.wav --model turbo
Whisper supports various audio formats including MP3, WAV, FLAC, M4A, and more through FFmpeg.

Transcribing Non-English Audio

To transcribe audio in a specific language, use the --language parameter:
whisper japanese.wav --language Japanese

Translating to English

Whisper can translate speech from any supported language directly into English:
whisper japanese.wav --model medium --language Japanese --task translate
The turbo model is not trained for translation tasks. Use medium or large models for the best translation results.

Common CLI Options

whisper audio.mp3 --model medium
View all available options:
whisper --help

Python API

Basic Transcription

Transcribe audio files programmatically using Whisper’s Python API:
import whisper

# Load the model (downloads on first use)
model = whisper.load_model("turbo")

# Transcribe audio
result = model.transcribe("audio.mp3")

# Print the transcribed text
print(result["text"])
The transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

Choosing a Model

Select different models based on your speed and accuracy requirements:
import whisper

# Fast and efficient (recommended)
model = whisper.load_model("turbo")

# Maximum accuracy (requires more VRAM)
model = whisper.load_model("large")

# Minimal resource usage
model = whisper.load_model("tiny")

# English-only (better accuracy for English)
model = whisper.load_model("base.en")

Checking Available Models

import whisper

models = whisper.available_models()
print(models)
# ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 
#  'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 
#  'large', 'large-v3-turbo', 'turbo']

Advanced Usage: Language Detection and Decoding

For lower-level access to the model, use detect_language() and decode():
import whisper

model = whisper.load_model("turbo")

# Load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# Make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Detect the spoken language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")

# Decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# Print the recognized text
print(result.text)

Device Selection

By default, Whisper uses CUDA if available, otherwise CPU. You can specify the device explicitly:
import whisper
import torch

# Explicitly use GPU
model = whisper.load_model("turbo", device="cuda")

# Explicitly use CPU
model = whisper.load_model("turbo", device="cpu")

# Check if CUDA is available
if torch.cuda.is_available():
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
    model = whisper.load_model("turbo", device="cuda")
else:
    print("Using CPU")
    model = whisper.load_model("turbo", device="cpu")

Complete Examples

1

Transcribe a Podcast Episode

import whisper

# Load the turbo model for fast transcription
model = whisper.load_model("turbo")

# Transcribe the audio
result = model.transcribe(
    "podcast.mp3",
    language="en",
    verbose=True
)

# Save transcription to file
with open("transcript.txt", "w", encoding="utf-8") as f:
    f.write(result["text"])

print("Transcription saved to transcript.txt")
2

Translate Foreign Language Audio

import whisper

# Use medium model for translation
model = whisper.load_model("medium")

# Translate Spanish speech to English text
result = model.transcribe(
    "spanish_audio.mp3",
    language="Spanish",
    task="translate"
)

print("English translation:")
print(result["text"])
3

Batch Process Multiple Files

import whisper
import os

model = whisper.load_model("turbo")

audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

for audio_file in audio_files:
    print(f"Processing {audio_file}...")
    result = model.transcribe(audio_file)
    
    # Save with same name but .txt extension
    output_file = os.path.splitext(audio_file)[0] + ".txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(result["text"])
    
    print(f"Saved to {output_file}")

Output Format

The transcribe() method returns a dictionary containing:
  • text: The full transcribed text
  • segments: List of segments with timestamps
  • language: Detected or specified language
import whisper

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")

# Full text
print(result["text"])

# Access individual segments with timestamps
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.2f}s - {end:.2f}s] {text}")

Performance Tips

Choose the Right Model

Use turbo for the best balance of speed and accuracy. Use tiny or base for real-time applications.

Use GPU Acceleration

Ensure PyTorch is installed with CUDA support for 10-20x faster transcription on NVIDIA GPUs.

Batch Processing

Load the model once and reuse it for multiple files to avoid repeated model loading overhead.

English-only Models

Use .en models (e.g., base.en) for better accuracy when transcribing English-only content.

Next Steps

Explore Advanced Features

Learn about word-level timestamps, language detection, and custom decoding options

View All Languages

Check the tokenizer.py file for the complete list of 99+ supported languages

Build docs developers (and LLMs) love