Overview
Whisper can translate speech from any supported language directly to English. This is different from transcription, which converts speech to text in the same language.
Translation is a speech-to-English task. Whisper translates the spoken audio directly to English text, not the transcribed text.
Model Requirements
The turbo model does NOT support translation. Only multilingual models can perform translation tasks.
Supported Models for Translation
tiny - Fastest, lower accuracy
base - Good balance for simple translation
small - Better accuracy
medium - Recommended for translation tasks
large - Highest accuracy
Models That Do NOT Support Translation
turbo - Returns original language even with --task translate
tiny.en, base.en, small.en, medium.en - English-only models
For best translation results, use the medium or large model.
CLI Usage
Basic Translation
Translate Japanese speech to English:
whisper japanese.wav --model medium --language Japanese --task translate
Translation vs Transcription
Transcription
Translation
Returns text in the original language:whisper japanese.wav --language Japanese --task transcribe
Output: “こんにちは、世界!” Returns English translation:whisper japanese.wav --model medium --language Japanese --task translate
Output: “Hello, world!”
Automatic Language Detection
You can omit the language parameter for automatic detection:
whisper foreign_audio.mp3 --model medium --task translate
Whisper will detect the language and translate to English.
Processing Multiple Files
whisper french.wav spanish.mp3 german.flac \
--model medium \
--task translate \
--output_dir ./translations
Python API
Basic Translation
import whisper
# Load a multilingual model (NOT turbo)
model = whisper.load_model("medium")
# Translate to English
result = model.transcribe("japanese.wav", task="translate")
print(result["text"])
Specify Source Language
result = model.transcribe(
"spanish.mp3",
language="es",
task="translate"
)
print(result["text"]) # English translation
Get Segment-Level Translations
result = model.transcribe("french.wav", task="translate")
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"]
print(f"[{start:.2f}s -> {end:.2f}s] {text}")
Advanced Translation Techniques
With Initial Prompt for Context
Provide context to improve translation quality:
result = model.transcribe(
"technical_presentation.mp3",
task="translate",
initial_prompt="This presentation is about artificial intelligence and machine learning."
)
Compare Original and Translation
import whisper
model = whisper.load_model("medium")
audio_file = "japanese.wav"
# Get original transcription
original = model.transcribe(audio_file, language="ja", task="transcribe")
print("Original:", original["text"])
# Get English translation
translation = model.transcribe(audio_file, language="ja", task="translate")
print("Translation:", translation["text"])
Batch Translation
import whisper
from pathlib import Path
model = whisper.load_model("medium")
audio_dir = Path("./foreign_audio")
for audio_file in audio_dir.glob("*.mp3"):
print(f"\nTranslating {audio_file.name}...")
result = model.transcribe(
str(audio_file),
task="translate"
)
# Detected language
print(f"Detected language: {result['language']}")
# English translation
print(f"Translation: {result['text']}")
# Save translation
output_file = audio_dir / f"{audio_file.stem}_en.txt"
with open(output_file, "w", encoding="utf-8") as f:
f.write(result["text"])
Lower-Level Translation API
Use DecodingOptions for fine-grained control:
import whisper
from whisper.decoding import DecodingOptions, decode
model = whisper.load_model("medium")
# Load and preprocess audio
audio = whisper.load_audio("japanese.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)
# Detect language
_, probs = model.detect_language(mel)
language = max(probs, key=probs.get)
print(f"Detected: {language} ({probs[language]:.2%})")
# Configure translation
options = DecodingOptions(
language=language,
task="translate", # Key parameter for translation
temperature=0.0,
fp16=True
)
# Decode 30-second segment
result = decode(model, mel, options)
print(f"Translation: {result.text}")
Translation with Word Timestamps
Word-level timestamps on translations may not be reliable, as the timing is based on the source language but the text is in English.
result = model.transcribe(
"french.mp3",
task="translate",
word_timestamps=True
)
for segment in result["segments"]:
print(f"\nSegment: {segment['text']}")
for word in segment.get("words", []):
print(f" {word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")
While the timestamps correspond to when words were spoken in the original language, the text is the English translation. Use with caution for subtitle generation.
Common Use Cases
Subtitle Generation for Foreign Films
whisper foreign_film.mp4 \
--model medium \
--task translate \
--output_format srt \
--output_dir ./subtitles
Real-Time Translation API
import whisper
import numpy as np
model = whisper.load_model("medium")
def translate_audio_chunk(audio_data: np.ndarray) -> str:
"""
Translate a chunk of audio data to English.
Args:
audio_data: Audio data at 16kHz sample rate
Returns:
English translation
"""
# Ensure audio is 30 seconds or less
audio = whisper.pad_or_trim(audio_data)
# Generate Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels)
mel = mel.to(model.device)
# Detect language
_, probs = model.detect_language(mel)
language = max(probs, key=probs.get)
# Translate
from whisper.decoding import DecodingOptions, decode
options = DecodingOptions(language=language, task="translate")
result = decode(model, mel, options)
return result.text
Multilingual Meeting Transcription
import whisper
model = whisper.load_model("medium")
def transcribe_meeting(audio_file: str) -> dict:
"""
Transcribe a multilingual meeting with both original and translated text.
"""
# Get original transcription with language detection
original = model.transcribe(audio_file, task="transcribe")
# Get English translation
translation = model.transcribe(audio_file, task="translate")
return {
"language": original["language"],
"original": original["segments"],
"translation": translation["segments"]
}
# Usage
result = transcribe_meeting("multilingual_meeting.mp3")
print(f"Detected language: {result['language']}")
for orig, trans in zip(result["original"], result["translation"]):
print(f"\n[{orig['start']:.1f}s - {orig['end']:.1f}s]")
print(f"Original: {orig['text']}")
print(f"English: {trans['text']}")
Supported Languages
Whisper can translate from any of its 99 supported languages to English. Some examples:
- Spanish (
es) → English
- French (
fr) → English
- Japanese (
ja) → English
- German (
de) → English
- Chinese (
zh) → English
- Russian (
ru) → English
- Korean (
ko) → English
- Arabic (
ar) → English
For a complete list of supported languages, see the language documentation.