Multi-Modal Capabilities
ONNX Runtime GenAI supports two primary categories of multi-modal models:Vision Models
Process images alongside text for visual understanding and reasoning tasks
Audio Models
Process audio inputs for speech recognition and transcription
Supported Model Types
Vision Models
ONNX Runtime GenAI supports several state-of-the-art vision-language models:Phi Vision Models
Phi Vision Models
- Phi-3 Vision: 128k context length vision model
- Phi-3.5 Vision: Enhanced vision capabilities
- Phi-4 Multi-Modal: Latest model with both vision and audio support
Qwen Vision Models
Qwen Vision Models
- Qwen2.5-VL: Advanced vision-language model with 3D positional encoding
- Supports dynamic image resolution and multi-image inputs
- MRoPE (Multi-Resolution Rotary Position Embedding) for better spatial understanding
Gemma Vision Models
Gemma Vision Models
- Gemma-3 Vision: Google’s multi-modal model family
- Available in multiple sizes: 4B, 12B, 27B parameters
- Supports BF16 and FP16 precision
Audio Models
Whisper
OpenAI’s Whisper models for speech recognition and transcription with support for multiple languages and beam search decoding.Learn more about Whisper
Multi-Modal Processing Architecture
Multi-modal models in ONNX Runtime GenAI consist of multiple components that work together:Input Processing
Images are preprocessed through vision encoders, and audio through speech encoders to create embeddings.
Embedding Generation
Visual or audio data is converted into embeddings that can be combined with text embeddings.
Working with Multi-Modal Models
Using the Multi-Modal Processor
All multi-modal models use a unified processor interface:Processing Different Input Types
- Vision
- Audio
- Mixed (Phi-4)
Input Preprocessing
Vision Input Preprocessing
Vision models require images to be preprocessed according to model-specific requirements:- Image Resizing: Images are resized to match the model’s expected input dimensions
- Normalization: Pixel values are normalized (typically to [0, 1] or [-1, 1])
- Patch Embedding: Images are divided into patches and embedded
- Position Encoding: Spatial position information is encoded
Audio Input Preprocessing
Audio models process raw audio waveforms:- Sampling Rate: Audio is resampled to the model’s expected rate (e.g., 16kHz for Whisper)
- Feature Extraction: Spectrograms or mel-frequency features are computed
- Windowing: Audio is divided into overlapping windows for processing
Generation with Multi-Modal Inputs
Execution Providers
Multi-modal models support multiple execution providers for optimal performance:For best performance with vision models, use CUDA or DirectML execution providers. CPU inference is supported but may be slower for large vision models.
Configuration Files
Multi-modal models require specific configuration files:Required Files
genai_config.json: Core model configuration including model paths, precision, and execution settingsprocessor_config.json: Vision processor configuration (for vision models)speech_processor.json: Audio processor configuration (for audio-enabled models like Phi-4)vision_processor.json: Vision processor configuration (for Phi-4)
Example Configuration Structure
Best Practices
Batch Processing
Process multiple images or audio files in batches for better throughput
Precision Selection
Use FP16 or BF16 for faster inference on compatible hardware
Memory Management
Monitor memory usage with large images or long audio files
Input Validation
Validate image formats and audio sampling rates before processing
Next Steps
Phi Vision Models
Learn how to use Microsoft’s Phi vision models
Qwen Vision Models
Explore Qwen’s advanced vision capabilities
Gemma Vision Models
Work with Google’s Gemma vision models
Whisper Audio
Process audio with Whisper models