Whisper Audio Models

Whisper is OpenAI’s powerful automatic speech recognition (ASR) model that can transcribe audio in multiple languages and translate speech to English. ONNX Runtime GenAI provides optimized support for Whisper models with hardware acceleration.

Overview

Whisper models provide:

Multi-language support: Transcribe audio in 99 languages
Robust performance: Trained on 680,000 hours of multilingual data
Translation capability: Translate foreign language speech to English
Punctuation and casing: Automatic formatting of transcriptions
Timestamp support: Optional word-level or segment-level timestamps

Whisper models in ONNX Runtime GenAI use beam search decoding for high-quality transcriptions.

Model Architecture

Whisper uses an encoder-decoder transformer architecture:

Audio Encoder: Processes audio spectrograms into embeddings
Text Decoder: Generates transcription tokens autoregressively
Multi-task Framework: Supports transcription, translation, and language detection

Audio Preprocessing

Whisper expects audio to be:

Sampling Rate: 16 kHz
Format: Mono channel
Duration: Up to 30 seconds per segment (longer audio is automatically chunked)

Using Whisper Models

Basic Transcription

import onnxruntime_genai as og

# Load model
config = og.Config("path/to/whisper-model")
model = og.Model(config)
processor = model.create_multimodal_processor()

# Load audio file
audios = og.Audios.open("audio.wav")

# Create decoder prompt for transcription
decoder_prompt_tokens = [
    "<|startoftranscript|>",
    "<|en|>",              # Language: English
    "<|transcribe|>",      # Task: transcribe (not translate)
    "<|notimestamps|>"     # No timestamps
]
prompt = "".join(decoder_prompt_tokens)

# Process audio
inputs = processor(prompt, audios=audios)

# Generate transcription with beam search
params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,       # Use greedy/beam search, not sampling
    num_beams=4,           # Beam search with 4 beams
    num_return_sequences=4, # Return all beam results
    max_length=448         # Maximum transcription length
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

# Generate tokens
while not generator.is_done():
    generator.generate_next_token()

# Get transcription from best beam
tokens = generator.get_sequence(0)
transcription = processor.decode(tokens)

print(f"Transcription: {transcription}")

Multi-File Batch Processing

import onnxruntime_genai as og

# Load multiple audio files
audio_paths = ["audio1.wav", "audio2.mp3", "audio3.wav"]
audios = og.Audios.open(*audio_paths)

batch_size = len(audio_paths)

# Create prompts for batch
decoder_prompt_tokens = [
    "<|startoftranscript|>",
    "<|en|>",
    "<|transcribe|>",
    "<|notimestamps|>"
]
prompts = ["".join(decoder_prompt_tokens)] * batch_size

# Process batch
inputs = processor(prompts, audios=audios)

# Generate with beam search
params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,
    num_beams=4,
    num_return_sequences=4,
    max_length=448,
    batch_size=batch_size
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()

# Get transcriptions
for i in range(batch_size):
    # Get best beam for each audio file
    tokens = generator.get_sequence(i * 4)  # 4 beams per audio
    transcription = processor.decode(tokens)
    print(f"Audio {i+1}: {transcription}")

Beam Search Results

Access multiple beam search hypotheses:

import onnxruntime_genai as og

# Load and process audio
audios = og.Audios.open("audio.wav")
prompt = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>"
inputs = processor(prompt, audios=audios)

# Generate with beam search
num_beams = 4
params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,
    num_beams=num_beams,
    num_return_sequences=num_beams,
    max_length=448,
    batch_size=1
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()

# Get all beam results
print("Beam search results:")
for i in range(num_beams):
    tokens = generator.get_sequence(i)
    transcription = processor.decode(tokens)
    print(f"  Beam {i}: {transcription}")

Language Support

Transcribe in Different Languages

# Transcribe Spanish audio
decoder_prompt_spanish = [
    "<|startoftranscript|>",
    "<|es|>",              # Language: Spanish
    "<|transcribe|>",
    "<|notimestamps|>"
]

# Transcribe French audio
decoder_prompt_french = [
    "<|startoftranscript|>",
    "<|fr|>",              # Language: French
    "<|transcribe|>",
    "<|notimestamps|>"
]

# Transcribe Japanese audio
decoder_prompt_japanese = [
    "<|startoftranscript|>",
    "<|ja|>",              # Language: Japanese
    "<|transcribe|>",
    "<|notimestamps|>"
]

Supported Language Codes

Common Language Codes

<|en|> - English
<|es|> - Spanish
<|fr|> - French
<|de|> - German
<|it|> - Italian
<|pt|> - Portuguese
<|ru|> - Russian
<|ja|> - Japanese
<|ko|> - Korean
<|zh|> - Chinese
<|ar|> - Arabic
<|hi|> - Hindi

See Whisper documentation for the full list of 99 supported languages.

Translation to English

Translate non-English audio to English:

import onnxruntime_genai as og

# Load non-English audio
audios = og.Audios.open("french_audio.wav")

# Create translation prompt
decoder_prompt_tokens = [
    "<|startoftranscript|>",
    "<|fr|>",              # Source language: French
    "<|translate|>",       # Task: translate to English
    "<|notimestamps|>"
]
prompt = "".join(decoder_prompt_tokens)

# Process and generate
inputs = processor(prompt, audios=audios)

params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,
    num_beams=4,
    num_return_sequences=1,
    max_length=448
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()

tokens = generator.get_sequence(0)
english_translation = processor.decode(tokens)

print(f"English Translation: {english_translation}")

Audio Input Handling

Supported Audio Formats

Whisper supports common audio formats:

WAV
MP3
FLAC
OGG
M4A

Loading Audio Files

import os
import onnxruntime_genai as og

# Single file
if os.path.exists("audio.wav"):
    audios = og.Audios.open("audio.wav")
else:
    raise FileNotFoundError("Audio file not found")

# Multiple files
audio_files = ["audio1.wav", "audio2.mp3", "audio3.flac"]
for path in audio_files:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Audio file not found: {path}")

audios = og.Audios.open(*audio_files)

Audio Preprocessing

Audio is automatically preprocessed:

Resampling: Converted to 16 kHz sampling rate
Channel Mixing: Stereo audio converted to mono
Normalization: Audio levels normalized
Feature Extraction: Converted to mel-spectrogram features

Advanced Usage

Custom Search Parameters

params = og.GeneratorParams(model)
params.set_search_options(
    do_sample=False,           # Deterministic decoding
    num_beams=5,               # More beams = higher quality, slower
    num_return_sequences=1,    # Return only best result
    max_length=448,            # Maximum token length
    length_penalty=1.0,        # No length penalty
    repetition_penalty=1.0     # No repetition penalty
)

Interactive Transcription

import onnxruntime_genai as og
import glob
import readline

def _complete(text, state):
    return (glob.glob(text + "*") + [None])[state]

class WhisperTranscriber:
    def __init__(self, model_path: str, execution_provider: str = "cuda"):
        config = og.Config(model_path)
        if execution_provider != "follow_config":
            config.clear_providers()
            if execution_provider != "cpu":
                config.append_provider(execution_provider)
        
        self.model = og.Model(config)
        self.processor = self.model.create_multimodal_processor()
    
    def transcribe(
        self,
        audio_paths: list,
        language: str = "en",
        num_beams: int = 4
    ) -> list:
        # Load audio
        audios = og.Audios.open(*audio_paths)
        batch_size = len(audio_paths)
        
        # Create prompts
        decoder_prompt = [
            "<|startoftranscript|>",
            f"<|{language}|>",
            "<|transcribe|>",
            "<|notimestamps|>"
        ]
        prompts = ["".join(decoder_prompt)] * batch_size
        
        # Process
        inputs = self.processor(prompts, audios=audios)
        
        # Generate
        params = og.GeneratorParams(self.model)
        params.set_search_options(
            do_sample=False,
            num_beams=num_beams,
            num_return_sequences=num_beams,
            max_length=448,
            batch_size=batch_size
        )
        
        generator = og.Generator(self.model, params)
        generator.set_inputs(inputs)
        
        while not generator.is_done():
            generator.generate_next_token()
        
        # Get results
        transcriptions = []
        for i in range(batch_size * num_beams):
            tokens = generator.get_sequence(i)
            transcription = self.processor.decode(tokens)
            transcriptions.append(transcription)
        
        return transcriptions

# Interactive mode
if __name__ == "__main__":
    transcriber = WhisperTranscriber("./whisper-model", "cuda")
    
    readline.set_completer_delims(" \t\n;")
    readline.parse_and_bind("tab: complete")
    readline.set_completer(_complete)
    
    while True:
        audio_input = input("Audio Paths (comma separated, or 'quit'): ")
        if audio_input.lower() == "quit":
            break
        
        audio_paths = [path.strip() for path in audio_input.split(",")]
        
        language = input("Language code (default: en): ").strip() or "en"
        
        print("\nTranscribing...")
        results = transcriber.transcribe(audio_paths, language=language)
        
        print("\nResults:")
        for i, transcription in enumerate(results):
            batch_idx = i // 4
            beam_idx = i % 4
            print(f"  File {batch_idx + 1}, Beam {beam_idx}: {transcription}")
        print()

Performance Optimization

Execution Providers

Choose the best execution provider for your hardware:

CUDA (NVIDIA)
CPU

config = og.Config("./whisper-model")
config.clear_providers()
config.append_provider("cuda")
model = og.Model(config)

Best for NVIDIA GPUs. Provides fastest inference.

config = og.Config("./whisper-model")
# CPU provider is default
model = og.Model(config)

Works on all systems. Slower than GPU but no special hardware required.

Beam Search Trade-offs

Adjust beam search parameters based on your needs:

# Fast (lower quality)
params.set_search_options(
    num_beams=1,           # Greedy decoding
    num_return_sequences=1
)

# Balanced
params.set_search_options(
    num_beams=4,
    num_return_sequences=1
)

# High quality (slower)
params.set_search_options(
    num_beams=8,
    num_return_sequences=1
)

Batch Processing

Process multiple files together for better throughput:

# Process files individually (slower)
for audio_path in audio_paths:
    audios = og.Audios.open(audio_path)
    # ... process ...

# Process as batch (faster)
audios = og.Audios.open(*audio_paths)
# ... process all at once ...

Example Application: Audio Transcription CLI

import onnxruntime_genai as og
import argparse
import os
from typing import List

def transcribe_audio(
    model_path: str,
    audio_paths: List[str],
    language: str = "en",
    translate: bool = False,
    num_beams: int = 4,
    execution_provider: str = "cuda"
) -> List[str]:
    """Transcribe audio files using Whisper.
    
    Args:
        model_path: Path to ONNX Whisper model
        audio_paths: List of audio file paths
        language: Source language code
        translate: Translate to English if True
        num_beams: Number of beams for beam search
        execution_provider: Hardware acceleration provider
    
    Returns:
        List of transcriptions
    """
    # Validate audio files
    for path in audio_paths:
        if not os.path.exists(path):
            raise FileNotFoundError(f"Audio file not found: {path}")
    
    # Load model
    config = og.Config(model_path)
    if execution_provider != "follow_config":
        config.clear_providers()
        if execution_provider != "cpu":
            config.append_provider(execution_provider)
    
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    
    # Load audio files
    print(f"Loading {len(audio_paths)} audio file(s)...")
    audios = og.Audios.open(*audio_paths)
    
    # Create decoder prompts
    batch_size = len(audio_paths)
    task_token = "<|translate|>" if translate else "<|transcribe|>"
    decoder_prompt_tokens = [
        "<|startoftranscript|>",
        f"<|{language}|>",
        task_token,
        "<|notimestamps|>"
    ]
    prompts = ["".join(decoder_prompt_tokens)] * batch_size
    
    # Process audio
    print("Processing audio...")
    inputs = processor(prompts, audios=audios)
    
    # Generate transcriptions
    params = og.GeneratorParams(model)
    params.set_search_options(
        do_sample=False,
        num_beams=num_beams,
        num_return_sequences=num_beams,
        max_length=448,
        batch_size=batch_size
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    print("Generating transcriptions...")
    while not generator.is_done():
        generator.generate_next_token()
    
    # Extract transcriptions (best beam for each file)
    transcriptions = []
    for i in range(batch_size):
        tokens = generator.get_sequence(i * num_beams)
        transcription = processor.decode(tokens)
        transcriptions.append(transcription.strip())
    
    return transcriptions

def main():
    parser = argparse.ArgumentParser(
        description="Transcribe audio files using Whisper"
    )
    parser.add_argument(
        "-m", "--model_path",
        type=str, required=True,
        help="Path to ONNX Whisper model"
    )
    parser.add_argument(
        "-a", "--audio",
        type=str, nargs="+", required=True,
        help="Audio file path(s)"
    )
    parser.add_argument(
        "-l", "--language",
        type=str, default="en",
        help="Source language code (default: en)"
    )
    parser.add_argument(
        "-t", "--translate",
        action="store_true",
        help="Translate to English"
    )
    parser.add_argument(
        "-b", "--num_beams",
        type=int, default=4,
        help="Number of beams for beam search (default: 4)"
    )
    parser.add_argument(
        "-e", "--execution_provider",
        type=str, default="cuda",
        choices=["cpu", "cuda"],
        help="Execution provider (default: cuda)"
    )
    parser.add_argument(
        "-o", "--output",
        type=str,
        help="Output file for transcriptions (optional)"
    )
    
    args = parser.parse_args()
    
    # Transcribe
    try:
        transcriptions = transcribe_audio(
            model_path=args.model_path,
            audio_paths=args.audio,
            language=args.language,
            translate=args.translate,
            num_beams=args.num_beams,
            execution_provider=args.execution_provider
        )
        
        # Display results
        print("\n" + "=" * 60)
        print("TRANSCRIPTIONS")
        print("=" * 60)
        
        for i, (path, transcription) in enumerate(zip(args.audio, transcriptions), 1):
            print(f"\n[{i}] {path}")
            print(f"    {transcription}")
        
        # Save to file if requested
        if args.output:
            with open(args.output, "w", encoding="utf-8") as f:
                for path, transcription in zip(args.audio, transcriptions):
                    f.write(f"{path}\n{transcription}\n\n")
            print(f"\nTranscriptions saved to: {args.output}")
    
    except Exception as e:
        print(f"Error: {e}")
        return 1
    
    return 0

if __name__ == "__main__":
    exit(main())

Troubleshooting

Audio File Not Loading

import os

# Verify file exists
audio_path = "audio.wav"
if not os.path.exists(audio_path):
    print(f"File not found: {audio_path}")

# Check file format
import mimetypes
mime_type = mimetypes.guess_type(audio_path)[0]
print(f"Detected file type: {mime_type}")

# Supported formats: audio/wav, audio/mpeg, audio/flac, etc.

Poor Transcription Quality

Improve transcription quality:

Increase beam search beams:

params.set_search_options(num_beams=8)  # More beams

Ensure correct language code:

# Use correct language for best results
decoder_prompt = ["<|startoftranscript|>", "<|es|>", ...]  # Spanish

Check audio quality:
- Ensure 16 kHz sampling rate
- Minimize background noise
- Use clear speech

Out of Memory

For long audio files:

# Process in smaller chunks
# Whisper automatically handles up to 30 second segments
# For longer files, consider splitting before processing

# Or reduce batch size
batch_size = 1  # Process one file at a time

Next Steps

Phi-4 Multi-Modal

Combine audio with vision using Phi-4

Model Optimization

Optimize Whisper for faster inference

Deployment Guide

Deploy Whisper to production

API Reference

Explore the full API documentation

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Overview

Model Architecture

Audio Preprocessing

Using Whisper Models

Basic Transcription

Multi-File Batch Processing

Beam Search Results

Language Support

Transcribe in Different Languages

Supported Language Codes

Translation to English

Audio Input Handling

Supported Audio Formats

Loading Audio Files

Audio Preprocessing

Advanced Usage

Custom Search Parameters

Interactive Transcription

Performance Optimization

Example Application: Audio Transcription CLI

Troubleshooting

Next Steps

Phi-4 Multi-Modal

Model Optimization

Deployment Guide

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Overview

​Model Architecture

​Audio Preprocessing

​Using Whisper Models

​Basic Transcription

​Multi-File Batch Processing

​Beam Search Results

​Language Support

​Transcribe in Different Languages

​Supported Language Codes

​Translation to English

​Audio Input Handling

​Supported Audio Formats

​Loading Audio Files

​Audio Preprocessing

​Advanced Usage

​Custom Search Parameters

​Interactive Transcription

​Performance Optimization

​Example Application: Audio Transcription CLI

​Troubleshooting

​Next Steps

Phi-4 Multi-Modal

Model Optimization

Deployment Guide

API Reference

Build docs developers (and LLMs) love

Overview

Model Architecture

Audio Preprocessing

Using Whisper Models

Basic Transcription

Multi-File Batch Processing

Beam Search Results

Language Support

Transcribe in Different Languages

Supported Language Codes

Translation to English

Audio Input Handling

Supported Audio Formats

Loading Audio Files

Audio Preprocessing

Advanced Usage

Custom Search Parameters

Interactive Transcription

Performance Optimization

Example Application: Audio Transcription CLI

Troubleshooting

Next Steps