Skip to main content
ONNX Runtime GenAI provides comprehensive support for multi-modal models that can process multiple types of inputs such as images, audio, and text. This enables powerful AI applications that can understand and generate responses based on visual and auditory information alongside text.

Multi-Modal Capabilities

ONNX Runtime GenAI supports two primary categories of multi-modal models:

Vision Models

Process images alongside text for visual understanding and reasoning tasks

Audio Models

Process audio inputs for speech recognition and transcription

Supported Model Types

Vision Models

ONNX Runtime GenAI supports several state-of-the-art vision-language models:
  • Phi-3 Vision: 128k context length vision model
  • Phi-3.5 Vision: Enhanced vision capabilities
  • Phi-4 Multi-Modal: Latest model with both vision and audio support
Learn more about Phi Vision models
  • Qwen2.5-VL: Advanced vision-language model with 3D positional encoding
  • Supports dynamic image resolution and multi-image inputs
  • MRoPE (Multi-Resolution Rotary Position Embedding) for better spatial understanding
Learn more about Qwen Vision models
  • Gemma-3 Vision: Google’s multi-modal model family
  • Available in multiple sizes: 4B, 12B, 27B parameters
  • Supports BF16 and FP16 precision
Learn more about Gemma Vision models

Audio Models

Whisper

OpenAI’s Whisper models for speech recognition and transcription with support for multiple languages and beam search decoding.Learn more about Whisper

Multi-Modal Processing Architecture

Multi-modal models in ONNX Runtime GenAI consist of multiple components that work together:
1

Input Processing

Images are preprocessed through vision encoders, and audio through speech encoders to create embeddings.
2

Embedding Generation

Visual or audio data is converted into embeddings that can be combined with text embeddings.
3

Fusion

Multi-modal embeddings are fused with text embeddings using specialized fusion layers.
4

Language Model Processing

The combined embeddings are processed by the language model to generate responses.

Working with Multi-Modal Models

Using the Multi-Modal Processor

All multi-modal models use a unified processor interface:
import onnxruntime_genai as og

# Load model
model = og.Model("path/to/model")
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

Processing Different Input Types

# Load images
images = og.Images.open("image1.jpg", "image2.png")

# Process with prompt
prompt = "What do you see in these images?"
inputs = processor(prompt, images=images)

Input Preprocessing

Vision Input Preprocessing

Vision models require images to be preprocessed according to model-specific requirements:
  • Image Resizing: Images are resized to match the model’s expected input dimensions
  • Normalization: Pixel values are normalized (typically to [0, 1] or [-1, 1])
  • Patch Embedding: Images are divided into patches and embedded
  • Position Encoding: Spatial position information is encoded

Audio Input Preprocessing

Audio models process raw audio waveforms:
  • Sampling Rate: Audio is resampled to the model’s expected rate (e.g., 16kHz for Whisper)
  • Feature Extraction: Spectrograms or mel-frequency features are computed
  • Windowing: Audio is divided into overlapping windows for processing

Generation with Multi-Modal Inputs

import onnxruntime_genai as og

# Setup model
config = og.Config("path/to/model")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Load inputs
images = og.Images.open("image.jpg")
prompt = "Describe this image in detail."

# Process inputs
inputs = processor(prompt, images=images)

# Create generator
params = og.GeneratorParams(model)
params.set_search_options(max_length=2048, do_sample=True, top_p=0.9, temperature=0.7)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

# Generate response
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end='', flush=True)

Execution Providers

Multi-modal models support multiple execution providers for optimal performance:
python model-mm.py -m ./model/cpu -e cpu
For best performance with vision models, use CUDA or DirectML execution providers. CPU inference is supported but may be slower for large vision models.

Configuration Files

Multi-modal models require specific configuration files:

Required Files

  • genai_config.json: Core model configuration including model paths, precision, and execution settings
  • processor_config.json: Vision processor configuration (for vision models)
  • speech_processor.json: Audio processor configuration (for audio-enabled models like Phi-4)
  • vision_processor.json: Vision processor configuration (for Phi-4)

Example Configuration Structure

{
  "model": {
    "decoder": {
      "filename": "model.onnx",
      "session_options": {
        "log_severity_level": 2,
        "provider_options": []
      }
    },
    "vision": {
      "inputs": {
        "pixel_values": "pixel_values"
      },
      "filename": "vision_model.onnx"
    }
  }
}

Best Practices

Batch Processing

Process multiple images or audio files in batches for better throughput

Precision Selection

Use FP16 or BF16 for faster inference on compatible hardware

Memory Management

Monitor memory usage with large images or long audio files

Input Validation

Validate image formats and audio sampling rates before processing

Next Steps

Phi Vision Models

Learn how to use Microsoft’s Phi vision models

Qwen Vision Models

Explore Qwen’s advanced vision capabilities

Gemma Vision Models

Work with Google’s Gemma vision models

Whisper Audio

Process audio with Whisper models

Build docs developers (and LLMs) love