Overview
The whisper command-line tool provides a simple interface for transcribing and translating audio files. It supports multiple audio formats and offers extensive customization options.
Basic Usage
Install Whisper
pip install -U openai-whisper
You’ll also need ffmpeg installed on your system: Ubuntu/Debian
macOS
Windows (Chocolatey)
sudo apt update && sudo apt install ffmpeg
Transcribe an audio file
By default, this uses the turbo model and outputs all available formats (txt, vtt, srt, tsv, json).
Command Syntax
whisper [audio_files...] [options]
Multiple Files
Process multiple audio files in one command:
whisper audio.flac audio.mp3 audio.wav --model turbo
Model Selection
Choose from different model sizes to balance speed and accuracy:
whisper audio.mp3 --model medium
Available models: tiny, base, small, medium, large, turbo, or English-only variants (tiny.en, base.en, small.en, medium.en).
The default model is turbo, which offers fast transcription with good accuracy for English and multilingual content.
Language Options
Automatic Language Detection
By default, Whisper detects the language automatically:
Specify Language
For better performance, specify the language explicitly:
whisper japanese.wav --language Japanese
You can use either the language name (e.g., Japanese, Spanish) or language code (e.g., ja, es).
Translation to English
Translate non-English speech directly to English:
whisper japanese.wav --model medium --language Japanese --task translate
The turbo model does not support translation. Use multilingual models (tiny, base, small, medium, large) for translation tasks.
Output Options
Output Directory
Specify where to save the transcription files:
whisper audio.mp3 --output_dir ./transcripts
Choose specific output formats:
whisper audio.mp3 --output_format srt
Available formats:
txt - Plain text
vtt - WebVTT subtitles
srt - SubRip subtitles
tsv - Tab-separated values with timestamps
json - JSON with detailed segment information
all - Generate all formats (default)
Advanced Options
Word-Level Timestamps
Extract word-level timestamps for precise timing:
whisper audio.mp3 --word_timestamps True
This enables additional subtitle formatting options:
whisper audio.mp3 --word_timestamps True --max_line_width 50 --highlight_words True
Device Selection
Choose between CPU and GPU processing:
whisper audio.mp3 --device cuda # Use GPU
whisper audio.mp3 --device cpu # Use CPU
Initial Prompt
Provide context or custom vocabulary to improve accuracy:
whisper audio.mp3 --initial_prompt "This is a technical discussion about machine learning and neural networks."
Temperature and Sampling
Use temperature 0 for deterministic output: whisper audio.mp3 --temperature 0 --beam_size 5
Use non-zero temperature for sampling: whisper audio.mp3 --temperature 0.8 --best_of 5
Compression and Quality Thresholds
whisper audio.mp3 \
--compression_ratio_threshold 2.4 \
--logprob_threshold -1.0 \
--no_speech_threshold 0.6
--compression_ratio_threshold: Detect and retry overly repetitive outputs (default: 2.4)
--logprob_threshold: Retry if average log probability is too low (default: -1.0)
--no_speech_threshold: Detect silent segments (default: 0.6)
Common Examples
Transcribe with High Accuracy
whisper interview.mp3 --model large --language English
Generate SRT Subtitles
whisper video.mp4 --model medium --output_format srt --word_timestamps True
Process Specific Audio Clips
whisper podcast.mp3 --clip_timestamps "0,300,600,900" --output_dir ./segments
This processes clips from 0-300s and 600-900s.
Batch Processing with Consistent Settings
whisper * .wav --model turbo --language English --output_dir ./output --output_format json
Full Options Reference
View all available options:
For large files or batch processing, consider using a GPU with --device cuda to significantly speed up transcription.