Overview
Whisper models provide:- Multi-language support: Transcribe audio in 99 languages
- Robust performance: Trained on 680,000 hours of multilingual data
- Translation capability: Translate foreign language speech to English
- Punctuation and casing: Automatic formatting of transcriptions
- Timestamp support: Optional word-level or segment-level timestamps
Whisper models in ONNX Runtime GenAI use beam search decoding for high-quality transcriptions.
Model Architecture
Whisper uses an encoder-decoder transformer architecture:- Audio Encoder: Processes audio spectrograms into embeddings
- Text Decoder: Generates transcription tokens autoregressively
- Multi-task Framework: Supports transcription, translation, and language detection
Audio Preprocessing
Whisper expects audio to be:- Sampling Rate: 16 kHz
- Format: Mono channel
- Duration: Up to 30 seconds per segment (longer audio is automatically chunked)
Using Whisper Models
Basic Transcription
Multi-File Batch Processing
Beam Search Results
Access multiple beam search hypotheses:Language Support
Transcribe in Different Languages
Supported Language Codes
Common Language Codes
Common Language Codes
<|en|>- English<|es|>- Spanish<|fr|>- French<|de|>- German<|it|>- Italian<|pt|>- Portuguese<|ru|>- Russian<|ja|>- Japanese<|ko|>- Korean<|zh|>- Chinese<|ar|>- Arabic<|hi|>- Hindi
Translation to English
Translate non-English audio to English:Audio Input Handling
Supported Audio Formats
Whisper supports common audio formats:- WAV
- MP3
- FLAC
- OGG
- M4A
Loading Audio Files
Audio Preprocessing
Audio is automatically preprocessed:- Resampling: Converted to 16 kHz sampling rate
- Channel Mixing: Stereo audio converted to mono
- Normalization: Audio levels normalized
- Feature Extraction: Converted to mel-spectrogram features
Advanced Usage
Custom Search Parameters
Interactive Transcription
Performance Optimization
Execution Providers
Execution Providers
Choose the best execution provider for your hardware:Best for NVIDIA GPUs. Provides fastest inference.
- CUDA (NVIDIA)
- CPU
Beam Search Trade-offs
Beam Search Trade-offs
Adjust beam search parameters based on your needs:
Batch Processing
Batch Processing
Process multiple files together for better throughput:
Example Application: Audio Transcription CLI
Troubleshooting
Audio File Not Loading
Audio File Not Loading
Poor Transcription Quality
Poor Transcription Quality
Improve transcription quality:
-
Increase beam search beams:
-
Ensure correct language code:
-
Check audio quality:
- Ensure 16 kHz sampling rate
- Minimize background noise
- Use clear speech
Out of Memory
Out of Memory
For long audio files:
Next Steps
Phi-4 Multi-Modal
Combine audio with vision using Phi-4
Model Optimization
Optimize Whisper for faster inference
Deployment Guide
Deploy Whisper to production
API Reference
Explore the full API documentation