Skip to main content

Overview

The whisper command-line tool provides a simple interface for transcribing and translating audio files. It supports multiple audio formats and offers extensive customization options.

Basic Usage

1

Install Whisper

pip install -U openai-whisper
You’ll also need ffmpeg installed on your system:
sudo apt update && sudo apt install ffmpeg
2

Transcribe an audio file

whisper audio.mp3
By default, this uses the turbo model and outputs all available formats (txt, vtt, srt, tsv, json).

Command Syntax

whisper [audio_files...] [options]

Multiple Files

Process multiple audio files in one command:
whisper audio.flac audio.mp3 audio.wav --model turbo

Model Selection

Choose from different model sizes to balance speed and accuracy:
whisper audio.mp3 --model medium
Available models: tiny, base, small, medium, large, turbo, or English-only variants (tiny.en, base.en, small.en, medium.en).
The default model is turbo, which offers fast transcription with good accuracy for English and multilingual content.

Language Options

Automatic Language Detection

By default, Whisper detects the language automatically:
whisper japanese.wav

Specify Language

For better performance, specify the language explicitly:
whisper japanese.wav --language Japanese
You can use either the language name (e.g., Japanese, Spanish) or language code (e.g., ja, es).

Translation to English

Translate non-English speech directly to English:
whisper japanese.wav --model medium --language Japanese --task translate
The turbo model does not support translation. Use multilingual models (tiny, base, small, medium, large) for translation tasks.

Output Options

Output Directory

Specify where to save the transcription files:
whisper audio.mp3 --output_dir ./transcripts

Output Format

Choose specific output formats:
whisper audio.mp3 --output_format srt
Available formats:
  • txt - Plain text
  • vtt - WebVTT subtitles
  • srt - SubRip subtitles
  • tsv - Tab-separated values with timestamps
  • json - JSON with detailed segment information
  • all - Generate all formats (default)

Advanced Options

Word-Level Timestamps

Extract word-level timestamps for precise timing:
whisper audio.mp3 --word_timestamps True
This enables additional subtitle formatting options:
whisper audio.mp3 --word_timestamps True --max_line_width 50 --highlight_words True

Device Selection

Choose between CPU and GPU processing:
whisper audio.mp3 --device cuda  # Use GPU
whisper audio.mp3 --device cpu   # Use CPU

Initial Prompt

Provide context or custom vocabulary to improve accuracy:
whisper audio.mp3 --initial_prompt "This is a technical discussion about machine learning and neural networks."

Temperature and Sampling

Use temperature 0 for deterministic output:
whisper audio.mp3 --temperature 0 --beam_size 5

Compression and Quality Thresholds

whisper audio.mp3 \
  --compression_ratio_threshold 2.4 \
  --logprob_threshold -1.0 \
  --no_speech_threshold 0.6
  • --compression_ratio_threshold: Detect and retry overly repetitive outputs (default: 2.4)
  • --logprob_threshold: Retry if average log probability is too low (default: -1.0)
  • --no_speech_threshold: Detect silent segments (default: 0.6)

Common Examples

Transcribe with High Accuracy

whisper interview.mp3 --model large --language English

Generate SRT Subtitles

whisper video.mp4 --model medium --output_format srt --word_timestamps True

Process Specific Audio Clips

whisper podcast.mp3 --clip_timestamps "0,300,600,900" --output_dir ./segments
This processes clips from 0-300s and 600-900s.

Batch Processing with Consistent Settings

whisper *.wav --model turbo --language English --output_dir ./output --output_format json

Full Options Reference

View all available options:
whisper --help
For large files or batch processing, consider using a GPU with --device cuda to significantly speed up transcription.

Build docs developers (and LLMs) love