Skip to main content
This page covers common issues you may encounter when installing or using Whisper, along with their solutions.

Installation Issues

Whisper requires the ffmpeg command-line tool to be installed on your system.Solution: Install ffmpeg using your system’s package manager:
# Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# Arch Linux
sudo pacman -S ffmpeg

# MacOS using Homebrew
brew install ffmpeg

# Windows using Chocolatey
choco install ffmpeg

# Windows using Scoop
scoop install ffmpeg
After installation, verify ffmpeg is accessible:
ffmpeg -version
Whisper depends on tiktoken for fast tokenization. If tiktoken doesn’t provide a pre-built wheel for your platform, you may need Rust installed.Symptoms:
  • Installation errors during pip install
  • Messages about missing Rust compiler
Solution:
  1. Install Rust by following the Getting Started guide
  2. Configure your PATH environment variable:
    export PATH="$HOME/.cargo/bin:$PATH"
    
  3. If you see No module named 'setuptools_rust', install it:
    pip install setuptools-rust
    
  4. Retry the Whisper installation:
    pip install -U openai-whisper
    
This error occurs when tiktoken needs to be built from source but setuptools_rust is not installed.Solution:
pip install setuptools-rust
pip install -U openai-whisper

Runtime Issues

This occurs when the selected model requires more VRAM than your GPU has available.VRAM Requirements:
  • tiny, base: ~1 GB
  • small: ~2 GB
  • medium: ~5 GB
  • turbo: ~6 GB
  • large: ~10 GB
Solutions:
  1. Use a smaller model:
    # Instead of:
    model = whisper.load_model("large")
    
    # Try:
    model = whisper.load_model("small")
    
  2. Use CPU instead of GPU:
    model = whisper.load_model("medium", device="cpu")
    
    Note: CPU inference will be significantly slower.
  3. Close other GPU-intensive applications to free up VRAM
Model weights are downloaded from the internet on first use.Solutions:
  1. Check your internet connection
  2. Use a different download location if your home directory has limited space:
    model = whisper.load_model("medium", download_root="/path/to/custom/location")
    
  3. Respect XDG_CACHE_HOME if set:
    export XDG_CACHE_HOME="/path/to/cache"
    
Whisper uses ffmpeg to handle audio files. Most formats are supported, but some may cause issues.Solution:Convert your audio file to a widely supported format like WAV or MP3:
ffmpeg -i input.audio output.wav
Then transcribe the converted file:
whisper output.wav
This can happen with audio files that contain no speech or very low-quality audio.Solutions:
  1. Verify audio file contains audible speech:
    ffplay your-audio.mp3
    
  2. Check audio levels - audio may be too quiet
  3. Try a larger model which may be more robust to poor quality audio
  4. Specify the language explicitly:
    whisper audio.mp3 --language English
    

Accuracy Issues

If transcriptions are inaccurate, consider these factors:Solutions:
  1. Use a larger model:
    whisper audio.mp3 --model large
    
  2. Specify the language to avoid language detection errors:
    whisper audio.mp3 --language Japanese
    
  3. Check for known limitations:
    • Low-resource languages may have higher error rates
    • Background noise affects accuracy
    • Multiple speakers or crosstalk reduce quality
  4. Improve audio quality:
    • Remove background noise
    • Use higher bitrate audio
    • Ensure clear speech without overlapping speakers
The model may generate plausible-sounding text that wasn’t actually spoken.Why it happens: Models are trained on large-scale noisy data and may combine language modeling with transcription.Mitigation strategies:
  1. Use beam search and temperature scheduling (already enabled by default in transcribe())
  2. Use larger models which tend to hallucinate less
  3. Enable word-level timestamps to identify suspicious sections:
    result = model.transcribe("audio.mp3", word_timestamps=True)
    
  4. Be especially cautious with low-resource languages where hallucinations are more common
The sequence-to-sequence architecture can sometimes generate repetitive text.Solutions:
  1. Adjust temperature settings:
    result = model.transcribe("audio.mp3", temperature=0.2)
    
  2. Use condition_on_previous_text parameter:
    result = model.transcribe("audio.mp3", condition_on_previous_text=False)
    
  3. Try a different model size - sometimes smaller or larger models perform better on specific audio
The turbo model is not trained for translation tasks.Symptoms:
  • Using --task translate with --model turbo returns original language instead of English
Solution:Use a multilingual model (medium or large) for translation:
# Don't use turbo for translation
whisper japanese.wav --model medium --language Japanese --task translate
The turbo model will return the original language even if --task translate is specified.

Platform-Specific Issues

Whisper requires Python 3.8 or newer.Check your Python version:
python --version
Solution: If your Python version is too old, upgrade to Python 3.8, 3.9, 3.10, 3.11, or 3.12.
Whisper is tested with PyTorch 1.10.1 and later versions.Solution: Update PyTorch to a recent version:
pip install --upgrade torch
For GPU support, follow PyTorch installation instructions for your platform.

Getting Help

If you encounter an issue not covered here:
  1. Check existing issues on GitHub
  2. Search discussions in the repository
  3. Create a new issue with:
    • Your Python and PyTorch versions
    • Full error message and stack trace
    • Minimal code to reproduce the issue
    • Information about your system (OS, GPU if applicable)

Build docs developers (and LLMs) love