Skip to main content
Whisper is OpenAI’s powerful automatic speech recognition (ASR) model that can transcribe audio in multiple languages and translate speech to English. ONNX Runtime GenAI provides optimized support for Whisper models with hardware acceleration.

Overview

Whisper models provide:
  • Multi-language support: Transcribe audio in 99 languages
  • Robust performance: Trained on 680,000 hours of multilingual data
  • Translation capability: Translate foreign language speech to English
  • Punctuation and casing: Automatic formatting of transcriptions
  • Timestamp support: Optional word-level or segment-level timestamps
Whisper models in ONNX Runtime GenAI use beam search decoding for high-quality transcriptions.

Model Architecture

Whisper uses an encoder-decoder transformer architecture:
  • Audio Encoder: Processes audio spectrograms into embeddings
  • Text Decoder: Generates transcription tokens autoregressively
  • Multi-task Framework: Supports transcription, translation, and language detection

Audio Preprocessing

Whisper expects audio to be:
  • Sampling Rate: 16 kHz
  • Format: Mono channel
  • Duration: Up to 30 seconds per segment (longer audio is automatically chunked)

Using Whisper Models

Basic Transcription

import onnxruntime_genai as og

# Load model
config = og.Config("path/to/whisper-model")
model = og.Model(config)
processor = model.create_multimodal_processor()

# Load audio file
audios = og.Audios.open("audio.wav")

# Create decoder prompt for transcription
decoder_prompt_tokens = [
    "<|startoftranscript|>",
    "<|en|>",              # Language: English
    "<|transcribe|>",      # Task: transcribe (not translate)
    "<|notimestamps|>"     # No timestamps
]
prompt = "".join(decoder_prompt_tokens)

# Process audio
inputs = processor(prompt, audios=audios)

# Generate transcription with beam search
params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,       # Use greedy/beam search, not sampling
    num_beams=4,           # Beam search with 4 beams
    num_return_sequences=4, # Return all beam results
    max_length=448         # Maximum transcription length
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

# Generate tokens
while not generator.is_done():
    generator.generate_next_token()

# Get transcription from best beam
tokens = generator.get_sequence(0)
transcription = processor.decode(tokens)

print(f"Transcription: {transcription}")

Multi-File Batch Processing

import onnxruntime_genai as og

# Load multiple audio files
audio_paths = ["audio1.wav", "audio2.mp3", "audio3.wav"]
audios = og.Audios.open(*audio_paths)

batch_size = len(audio_paths)

# Create prompts for batch
decoder_prompt_tokens = [
    "<|startoftranscript|>",
    "<|en|>",
    "<|transcribe|>",
    "<|notimestamps|>"
]
prompts = ["".join(decoder_prompt_tokens)] * batch_size

# Process batch
inputs = processor(prompts, audios=audios)

# Generate with beam search
params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,
    num_beams=4,
    num_return_sequences=4,
    max_length=448,
    batch_size=batch_size
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()

# Get transcriptions
for i in range(batch_size):
    # Get best beam for each audio file
    tokens = generator.get_sequence(i * 4)  # 4 beams per audio
    transcription = processor.decode(tokens)
    print(f"Audio {i+1}: {transcription}")

Beam Search Results

Access multiple beam search hypotheses:
import onnxruntime_genai as og

# Load and process audio
audios = og.Audios.open("audio.wav")
prompt = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>"
inputs = processor(prompt, audios=audios)

# Generate with beam search
num_beams = 4
params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,
    num_beams=num_beams,
    num_return_sequences=num_beams,
    max_length=448,
    batch_size=1
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()

# Get all beam results
print("Beam search results:")
for i in range(num_beams):
    tokens = generator.get_sequence(i)
    transcription = processor.decode(tokens)
    print(f"  Beam {i}: {transcription}")

Language Support

Transcribe in Different Languages

# Transcribe Spanish audio
decoder_prompt_spanish = [
    "<|startoftranscript|>",
    "<|es|>",              # Language: Spanish
    "<|transcribe|>",
    "<|notimestamps|>"
]

# Transcribe French audio
decoder_prompt_french = [
    "<|startoftranscript|>",
    "<|fr|>",              # Language: French
    "<|transcribe|>",
    "<|notimestamps|>"
]

# Transcribe Japanese audio
decoder_prompt_japanese = [
    "<|startoftranscript|>",
    "<|ja|>",              # Language: Japanese
    "<|transcribe|>",
    "<|notimestamps|>"
]

Supported Language Codes

  • <|en|> - English
  • <|es|> - Spanish
  • <|fr|> - French
  • <|de|> - German
  • <|it|> - Italian
  • <|pt|> - Portuguese
  • <|ru|> - Russian
  • <|ja|> - Japanese
  • <|ko|> - Korean
  • <|zh|> - Chinese
  • <|ar|> - Arabic
  • <|hi|> - Hindi
See Whisper documentation for the full list of 99 supported languages.

Translation to English

Translate non-English audio to English:
import onnxruntime_genai as og

# Load non-English audio
audios = og.Audios.open("french_audio.wav")

# Create translation prompt
decoder_prompt_tokens = [
    "<|startoftranscript|>",
    "<|fr|>",              # Source language: French
    "<|translate|>",       # Task: translate to English
    "<|notimestamps|>"
]
prompt = "".join(decoder_prompt_tokens)

# Process and generate
inputs = processor(prompt, audios=audios)

params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,
    num_beams=4,
    num_return_sequences=1,
    max_length=448
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()

tokens = generator.get_sequence(0)
english_translation = processor.decode(tokens)

print(f"English Translation: {english_translation}")

Audio Input Handling

Supported Audio Formats

Whisper supports common audio formats:
  • WAV
  • MP3
  • FLAC
  • OGG
  • M4A

Loading Audio Files

import os
import onnxruntime_genai as og

# Single file
if os.path.exists("audio.wav"):
    audios = og.Audios.open("audio.wav")
else:
    raise FileNotFoundError("Audio file not found")

# Multiple files
audio_files = ["audio1.wav", "audio2.mp3", "audio3.flac"]
for path in audio_files:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Audio file not found: {path}")

audios = og.Audios.open(*audio_files)

Audio Preprocessing

Audio is automatically preprocessed:
  1. Resampling: Converted to 16 kHz sampling rate
  2. Channel Mixing: Stereo audio converted to mono
  3. Normalization: Audio levels normalized
  4. Feature Extraction: Converted to mel-spectrogram features

Advanced Usage

Custom Search Parameters

params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,           # Deterministic decoding
    num_beams=5,               # More beams = higher quality, slower
    num_return_sequences=1,    # Return only best result
    max_length=448,            # Maximum token length
    length_penalty=1.0,        # No length penalty
    repetition_penalty=1.0     # No repetition penalty
)

Interactive Transcription

import onnxruntime_genai as og
import glob
import readline

def _complete(text, state):
    return (glob.glob(text + "*") + [None])[state]

class WhisperTranscriber:
    def __init__(self, model_path: str, execution_provider: str = "cuda"):
        config = og.Config(model_path)
        if execution_provider != "follow_config":
            config.clear_providers()
            if execution_provider != "cpu":
                config.append_provider(execution_provider)
        
        self.model = og.Model(config)
        self.processor = self.model.create_multimodal_processor()
    
    def transcribe(
        self,
        audio_paths: list,
        language: str = "en",
        num_beams: int = 4
    ) -> list:
        # Load audio
        audios = og.Audios.open(*audio_paths)
        batch_size = len(audio_paths)
        
        # Create prompts
        decoder_prompt = [
            "<|startoftranscript|>",
            f"<|{language}|>",
            "<|transcribe|>",
            "<|notimestamps|>"
        ]
        prompts = ["".join(decoder_prompt)] * batch_size
        
        # Process
        inputs = self.processor(prompts, audios=audios)
        
        # Generate
        params = og.GeneratorParams(self.model)
        params.set_search_options(
            do_sample=False,
            num_beams=num_beams,
            num_return_sequences=num_beams,
            max_length=448,
            batch_size=batch_size
        )
        
        generator = og.Generator(self.model, params)
        generator.set_inputs(inputs)
        
        while not generator.is_done():
            generator.generate_next_token()
        
        # Get results
        transcriptions = []
        for i in range(batch_size * num_beams):
            tokens = generator.get_sequence(i)
            transcription = self.processor.decode(tokens)
            transcriptions.append(transcription)
        
        return transcriptions

# Interactive mode
if __name__ == "__main__":
    transcriber = WhisperTranscriber("./whisper-model", "cuda")
    
    readline.set_completer_delims(" \t\n;")
    readline.parse_and_bind("tab: complete")
    readline.set_completer(_complete)
    
    while True:
        audio_input = input("Audio Paths (comma separated, or 'quit'): ")
        if audio_input.lower() == "quit":
            break
        
        audio_paths = [path.strip() for path in audio_input.split(",")]
        
        language = input("Language code (default: en): ").strip() or "en"
        
        print("\nTranscribing...")
        results = transcriber.transcribe(audio_paths, language=language)
        
        print("\nResults:")
        for i, transcription in enumerate(results):
            batch_idx = i // 4
            beam_idx = i % 4
            print(f"  File {batch_idx + 1}, Beam {beam_idx}: {transcription}")
        print()

Performance Optimization

Choose the best execution provider for your hardware:
config = og.Config("./whisper-model")
config.clear_providers()
config.append_provider("cuda")
model = og.Model(config)
Best for NVIDIA GPUs. Provides fastest inference.
Adjust beam search parameters based on your needs:
# Fast (lower quality)
params.set_search_options(
    num_beams=1,           # Greedy decoding
    num_return_sequences=1
)

# Balanced
params.set_search_options(
    num_beams=4,
    num_return_sequences=1
)

# High quality (slower)
params.set_search_options(
    num_beams=8,
    num_return_sequences=1
)
Process multiple files together for better throughput:
# Process files individually (slower)
for audio_path in audio_paths:
    audios = og.Audios.open(audio_path)
    # ... process ...

# Process as batch (faster)
audios = og.Audios.open(*audio_paths)
# ... process all at once ...

Example Application: Audio Transcription CLI

import onnxruntime_genai as og
import argparse
import os
from typing import List

def transcribe_audio(
    model_path: str,
    audio_paths: List[str],
    language: str = "en",
    translate: bool = False,
    num_beams: int = 4,
    execution_provider: str = "cuda"
) -> List[str]:
    """Transcribe audio files using Whisper.
    
    Args:
        model_path: Path to ONNX Whisper model
        audio_paths: List of audio file paths
        language: Source language code
        translate: Translate to English if True
        num_beams: Number of beams for beam search
        execution_provider: Hardware acceleration provider
    
    Returns:
        List of transcriptions
    """
    # Validate audio files
    for path in audio_paths:
        if not os.path.exists(path):
            raise FileNotFoundError(f"Audio file not found: {path}")
    
    # Load model
    config = og.Config(model_path)
    if execution_provider != "follow_config":
        config.clear_providers()
        if execution_provider != "cpu":
            config.append_provider(execution_provider)
    
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    
    # Load audio files
    print(f"Loading {len(audio_paths)} audio file(s)...")
    audios = og.Audios.open(*audio_paths)
    
    # Create decoder prompts
    batch_size = len(audio_paths)
    task_token = "<|translate|>" if translate else "<|transcribe|>"
    decoder_prompt_tokens = [
        "<|startoftranscript|>",
        f"<|{language}|>",
        task_token,
        "<|notimestamps|>"
    ]
    prompts = ["".join(decoder_prompt_tokens)] * batch_size
    
    # Process audio
    print("Processing audio...")
    inputs = processor(prompts, audios=audios)
    
    # Generate transcriptions
    params = og.GeneratorParams(model)
    params.set_search_options(
        do_sample=False,
        num_beams=num_beams,
        num_return_sequences=num_beams,
        max_length=448,
        batch_size=batch_size
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    print("Generating transcriptions...")
    while not generator.is_done():
        generator.generate_next_token()
    
    # Extract transcriptions (best beam for each file)
    transcriptions = []
    for i in range(batch_size):
        tokens = generator.get_sequence(i * num_beams)
        transcription = processor.decode(tokens)
        transcriptions.append(transcription.strip())
    
    return transcriptions

def main():
    parser = argparse.ArgumentParser(
        description="Transcribe audio files using Whisper"
    )
    parser.add_argument(
        "-m", "--model_path",
        type=str, required=True,
        help="Path to ONNX Whisper model"
    )
    parser.add_argument(
        "-a", "--audio",
        type=str, nargs="+", required=True,
        help="Audio file path(s)"
    )
    parser.add_argument(
        "-l", "--language",
        type=str, default="en",
        help="Source language code (default: en)"
    )
    parser.add_argument(
        "-t", "--translate",
        action="store_true",
        help="Translate to English"
    )
    parser.add_argument(
        "-b", "--num_beams",
        type=int, default=4,
        help="Number of beams for beam search (default: 4)"
    )
    parser.add_argument(
        "-e", "--execution_provider",
        type=str, default="cuda",
        choices=["cpu", "cuda"],
        help="Execution provider (default: cuda)"
    )
    parser.add_argument(
        "-o", "--output",
        type=str,
        help="Output file for transcriptions (optional)"
    )
    
    args = parser.parse_args()
    
    # Transcribe
    try:
        transcriptions = transcribe_audio(
            model_path=args.model_path,
            audio_paths=args.audio,
            language=args.language,
            translate=args.translate,
            num_beams=args.num_beams,
            execution_provider=args.execution_provider
        )
        
        # Display results
        print("\n" + "=" * 60)
        print("TRANSCRIPTIONS")
        print("=" * 60)
        
        for i, (path, transcription) in enumerate(zip(args.audio, transcriptions), 1):
            print(f"\n[{i}] {path}")
            print(f"    {transcription}")
        
        # Save to file if requested
        if args.output:
            with open(args.output, "w", encoding="utf-8") as f:
                for path, transcription in zip(args.audio, transcriptions):
                    f.write(f"{path}\n{transcription}\n\n")
            print(f"\nTranscriptions saved to: {args.output}")
    
    except Exception as e:
        print(f"Error: {e}")
        return 1
    
    return 0

if __name__ == "__main__":
    exit(main())

Troubleshooting

import os

# Verify file exists
audio_path = "audio.wav"
if not os.path.exists(audio_path):
    print(f"File not found: {audio_path}")

# Check file format
import mimetypes
mime_type = mimetypes.guess_type(audio_path)[0]
print(f"Detected file type: {mime_type}")

# Supported formats: audio/wav, audio/mpeg, audio/flac, etc.
Improve transcription quality:
  1. Increase beam search beams:
    params.set_search_options(num_beams=8)  # More beams
    
  2. Ensure correct language code:
    # Use correct language for best results
    decoder_prompt = ["<|startoftranscript|>", "<|es|>", ...]  # Spanish
    
  3. Check audio quality:
    • Ensure 16 kHz sampling rate
    • Minimize background noise
    • Use clear speech
For long audio files:
# Process in smaller chunks
# Whisper automatically handles up to 30 second segments
# For longer files, consider splitting before processing

# Or reduce batch size
batch_size = 1  # Process one file at a time

Next Steps

Phi-4 Multi-Modal

Combine audio with vision using Phi-4

Model Optimization

Optimize Whisper for faster inference

Deployment Guide

Deploy Whisper to production

API Reference

Explore the full API documentation

Build docs developers (and LLMs) love