Transcribe audio files programmatically using Whisper’s Python API:
import whisper# Load the model (downloads on first use)model = whisper.load_model("turbo")# Transcribe audioresult = model.transcribe("audio.mp3")# Print the transcribed textprint(result["text"])
The transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
For lower-level access to the model, use detect_language() and decode():
import whispermodel = whisper.load_model("turbo")# Load audio and pad/trim it to fit 30 secondsaudio = whisper.load_audio("audio.mp3")audio = whisper.pad_or_trim(audio)# Make log-Mel spectrogram and move to the same device as the modelmel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)# Detect the spoken language_, probs = model.detect_language(mel)detected_language = max(probs, key=probs.get)print(f"Detected language: {detected_language}")# Decode the audiooptions = whisper.DecodingOptions()result = whisper.decode(model, mel, options)# Print the recognized textprint(result.text)
By default, Whisper uses CUDA if available, otherwise CPU. You can specify the device explicitly:
import whisperimport torch# Explicitly use GPUmodel = whisper.load_model("turbo", device="cuda")# Explicitly use CPUmodel = whisper.load_model("turbo", device="cpu")# Check if CUDA is availableif torch.cuda.is_available(): print(f"Using GPU: {torch.cuda.get_device_name(0)}") model = whisper.load_model("turbo", device="cuda")else: print("Using CPU") model = whisper.load_model("turbo", device="cpu")
import whisper# Load the turbo model for fast transcriptionmodel = whisper.load_model("turbo")# Transcribe the audioresult = model.transcribe( "podcast.mp3", language="en", verbose=True)# Save transcription to filewith open("transcript.txt", "w", encoding="utf-8") as f: f.write(result["text"])print("Transcription saved to transcript.txt")
2
Translate Foreign Language Audio
import whisper# Use medium model for translationmodel = whisper.load_model("medium")# Translate Spanish speech to English textresult = model.transcribe( "spanish_audio.mp3", language="Spanish", task="translate")print("English translation:")print(result["text"])
3
Batch Process Multiple Files
import whisperimport osmodel = whisper.load_model("turbo")audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]for audio_file in audio_files: print(f"Processing {audio_file}...") result = model.transcribe(audio_file) # Save with same name but .txt extension output_file = os.path.splitext(audio_file)[0] + ".txt" with open(output_file, "w", encoding="utf-8") as f: f.write(result["text"]) print(f"Saved to {output_file}")