Overview
This example demonstrates how to build a multimodal AI application using ONNX Runtime GenAI in C#. TheModelMM example shows how to process images and audio inputs alongside text prompts, enabling vision-language and audio-language model interactions.
Key Features
- Multi-modal input processing: Handle images, audio, and text
- Streaming responses: Real-time token generation and display
- Multiple input formats: Support for various image and audio formats
- Model-specific formatting: Automatic adaptation to different model types (Phi-3, Phi-4, Qwen, Gemma)
Complete Implementation
The following code shows the completeModelMM function for multimodal interactions:
Program.cs
Usage Examples
Process Images
Process Audio
Combined Inputs
Interactive Mode
How It Works
1. Load Media Inputs
The example loads images and audio files using the GenAI API:2. Format Content for Model Type
Different models require different input formatting. The example automatically adapts:Supported Model Types
Phi-3 Vision / Phi-3.5 Vision:3. Process Multimodal Inputs
TheMultiModalProcessor handles encoding of all input types:
4. Generate Response
Once inputs are set, token generation works the same as text-only models:Key Components
MultiModalProcessor
Processes images and audio into tensors that the model can consume:Images Class
Loads and manages image inputs:Audios Class
Loads and manages audio inputs:Media Input Methods
Interactive Mode
When running in interactive mode, the example prompts for file paths:Command-Line Arguments
Specify paths directly via command-line arguments:File Path Formatting
- Paths can be absolute or relative
- Multiple paths separated by commas or spaces
- Paths with spaces should be quoted
- Files are validated before processing
Command-Line Options
| Option | Alias | Description |
|---|---|---|
--model_path | -m | Path to the model directory |
--image_paths | Comma-separated image file paths | |
--audio_paths | Comma-separated audio file paths | |
--execution_provider | -e | Execution provider (cpu, cuda, etc.) |
--system_prompt | -sp | System prompt for the conversation |
--user_prompt | -up | User prompt text |
--verbose | -v | Enable verbose logging |
--non_interactive | Run once without interactive loop | |
--temperature | -t | Sampling temperature |
--top_p | -p | Nucleus sampling probability |
--max_length | -l | Maximum generation length |
Supported Models
The example works with various multimodal models:- Phi-3 Vision - Image understanding
- Phi-3.5 Vision - Enhanced image processing
- Phi-4 Multimodal - Images and audio
- Qwen-2.5 VL - Vision-language tasks
- Fara - Multimodal understanding
- Gemma-3 - Structured multimodal inputs
Error Handling
The example includes robust error handling:- File validation: Checks that media files exist before processing
- Format detection: Automatically detects and handles different model types
- Graceful fallbacks: Falls back to text-only if media loading fails
- User feedback: Clear error messages for invalid inputs
Performance Considerations
- Media loading: Images and audio are loaded on-demand per query
- Memory management: Uses
usingstatements for proper disposal - Streaming output: Displays tokens as generated for responsive UX
- Timing metrics: Reports generation speed in tokens per second
See Also
- C# Chat Example - Text-only streaming chat
- MultiModalProcessor API - API reference
- Images API - Image loading reference
- Audios API - Audio loading reference