Skip to main content

Overview

Whisper can translate speech from any supported language directly to English. This is different from transcription, which converts speech to text in the same language.
Translation is a speech-to-English task. Whisper translates the spoken audio directly to English text, not the transcribed text.

Model Requirements

The turbo model does NOT support translation. Only multilingual models can perform translation tasks.

Supported Models for Translation

  • tiny - Fastest, lower accuracy
  • base - Good balance for simple translation
  • small - Better accuracy
  • medium - Recommended for translation tasks
  • large - Highest accuracy

Models That Do NOT Support Translation

  • turbo - Returns original language even with --task translate
  • tiny.en, base.en, small.en, medium.en - English-only models
For best translation results, use the medium or large model.

CLI Usage

Basic Translation

Translate Japanese speech to English:
whisper japanese.wav --model medium --language Japanese --task translate

Translation vs Transcription

Returns text in the original language:
whisper japanese.wav --language Japanese --task transcribe
Output: “こんにちは、世界!”

Automatic Language Detection

You can omit the language parameter for automatic detection:
whisper foreign_audio.mp3 --model medium --task translate
Whisper will detect the language and translate to English.

Processing Multiple Files

whisper french.wav spanish.mp3 german.flac \
  --model medium \
  --task translate \
  --output_dir ./translations

Python API

Basic Translation

import whisper

# Load a multilingual model (NOT turbo)
model = whisper.load_model("medium")

# Translate to English
result = model.transcribe("japanese.wav", task="translate")
print(result["text"])

Specify Source Language

result = model.transcribe(
    "spanish.mp3",
    language="es",
    task="translate"
)
print(result["text"])  # English translation

Get Segment-Level Translations

result = model.transcribe("french.wav", task="translate")

for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.2f}s -> {end:.2f}s] {text}")

Advanced Translation Techniques

With Initial Prompt for Context

Provide context to improve translation quality:
result = model.transcribe(
    "technical_presentation.mp3",
    task="translate",
    initial_prompt="This presentation is about artificial intelligence and machine learning."
)

Compare Original and Translation

import whisper

model = whisper.load_model("medium")
audio_file = "japanese.wav"

# Get original transcription
original = model.transcribe(audio_file, language="ja", task="transcribe")
print("Original:", original["text"])

# Get English translation
translation = model.transcribe(audio_file, language="ja", task="translate")
print("Translation:", translation["text"])

Batch Translation

import whisper
from pathlib import Path

model = whisper.load_model("medium")
audio_dir = Path("./foreign_audio")

for audio_file in audio_dir.glob("*.mp3"):
    print(f"\nTranslating {audio_file.name}...")
    
    result = model.transcribe(
        str(audio_file),
        task="translate"
    )
    
    # Detected language
    print(f"Detected language: {result['language']}")
    
    # English translation
    print(f"Translation: {result['text']}")
    
    # Save translation
    output_file = audio_dir / f"{audio_file.stem}_en.txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(result["text"])

Lower-Level Translation API

Use DecodingOptions for fine-grained control:
import whisper
from whisper.decoding import DecodingOptions, decode

model = whisper.load_model("medium")

# Load and preprocess audio
audio = whisper.load_audio("japanese.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Detect language
_, probs = model.detect_language(mel)
language = max(probs, key=probs.get)
print(f"Detected: {language} ({probs[language]:.2%})")

# Configure translation
options = DecodingOptions(
    language=language,
    task="translate",  # Key parameter for translation
    temperature=0.0,
    fp16=True
)

# Decode 30-second segment
result = decode(model, mel, options)
print(f"Translation: {result.text}")

Translation with Word Timestamps

Word-level timestamps on translations may not be reliable, as the timing is based on the source language but the text is in English.
result = model.transcribe(
    "french.mp3",
    task="translate",
    word_timestamps=True
)

for segment in result["segments"]:
    print(f"\nSegment: {segment['text']}")
    for word in segment.get("words", []):
        print(f"  {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")
While the timestamps correspond to when words were spoken in the original language, the text is the English translation. Use with caution for subtitle generation.

Common Use Cases

Subtitle Generation for Foreign Films

whisper foreign_film.mp4 \
  --model medium \
  --task translate \
  --output_format srt \
  --output_dir ./subtitles

Real-Time Translation API

import whisper
import numpy as np

model = whisper.load_model("medium")

def translate_audio_chunk(audio_data: np.ndarray) -> str:
    """
    Translate a chunk of audio data to English.
    
    Args:
        audio_data: Audio data at 16kHz sample rate
    
    Returns:
        English translation
    """
    # Ensure audio is 30 seconds or less
    audio = whisper.pad_or_trim(audio_data)
    
    # Generate Mel spectrogram
    mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels)
    mel = mel.to(model.device)
    
    # Detect language
    _, probs = model.detect_language(mel)
    language = max(probs, key=probs.get)
    
    # Translate
    from whisper.decoding import DecodingOptions, decode
    options = DecodingOptions(language=language, task="translate")
    result = decode(model, mel, options)
    
    return result.text

Multilingual Meeting Transcription

import whisper

model = whisper.load_model("medium")

def transcribe_meeting(audio_file: str) -> dict:
    """
    Transcribe a multilingual meeting with both original and translated text.
    """
    # Get original transcription with language detection
    original = model.transcribe(audio_file, task="transcribe")
    
    # Get English translation
    translation = model.transcribe(audio_file, task="translate")
    
    return {
        "language": original["language"],
        "original": original["segments"],
        "translation": translation["segments"]
    }

# Usage
result = transcribe_meeting("multilingual_meeting.mp3")
print(f"Detected language: {result['language']}")

for orig, trans in zip(result["original"], result["translation"]):
    print(f"\n[{orig['start']:.1f}s - {orig['end']:.1f}s]")
    print(f"Original: {orig['text']}")
    print(f"English: {trans['text']}")

Supported Languages

Whisper can translate from any of its 99 supported languages to English. Some examples:
  • Spanish (es) → English
  • French (fr) → English
  • Japanese (ja) → English
  • German (de) → English
  • Chinese (zh) → English
  • Russian (ru) → English
  • Korean (ko) → English
  • Arabic (ar) → English
For a complete list of supported languages, see the language documentation.

Build docs developers (and LLMs) love