C# Multimodal Example

Overview

This example demonstrates how to build a multimodal AI application using ONNX Runtime GenAI in C#. The ModelMM example shows how to process images and audio inputs alongside text prompts, enabling vision-language and audio-language model interactions.

Key Features

Multi-modal input processing: Handle images, audio, and text
Streaming responses: Real-time token generation and display
Multiple input formats: Support for various image and audio formats
Model-specific formatting: Automatic adaptation to different model types (Phi-3, Phi-4, Qwen, Gemma)

Complete Implementation

The following code shows the complete ModelMM function for multimodal interactions:

Program.cs

void ModelMM(
    Model model,
    Tokenizer tokenizer,
    TokenizerStream tokenizerStream,
    MultiModalProcessor processor,
    GeneratorParamsArgs generatorParamsArgs,
    GuidanceArgs guidanceArgs,
    List<string> imagePaths,
    List<string> audioPaths,
    string modelPath,
    string systemPrompt,
    string userPrompt,
    bool interactive,
    bool verbose
)
{
    // Creating running list of messages
    var system_message = new Dictionary<string, string>
    {
        { "role", "system" },
        { "content", systemPrompt }
    };
    var input_list = new List<Dictionary<string, string>>() { system_message };

    // Get and set guidance info if requested
    string guidance_type = "";
    string guidance_data = "";
    string tools = "";
    if (!string.IsNullOrEmpty(guidanceArgs.response_format))
    {
        Console.WriteLine("Make sure your tool call start id and tool call end id are marked as special in tokenizer.json");
        (guidance_type, guidance_data, tools) = Common.GetGuidance(
            response_format: guidanceArgs.response_format,
            filepath: guidanceArgs.tools_file,
            text_output: guidanceArgs.text_output,
            tool_output: guidanceArgs.tool_output,
            tool_call_start: guidanceArgs.tool_call_start,
            tool_call_end: guidanceArgs.tool_call_end
        );
        input_list[0]["tools"] = tools;
    }

    // Streaming Q&A
    do
    {
        // Get images
        Images? images;
        int num_images;
        (images, num_images) = Common.GetUserImages(imagePaths, interactive);

        // Get audios
        Audios? audios;
        int num_audios;
        (audios, num_audios) = Common.GetUserAudios(audioPaths, interactive);

        // Get user prompt
        string text = Common.GetUserPrompt(userPrompt, interactive);
        if (string.Compare(text, "quit()", StringComparison.OrdinalIgnoreCase) == 0)
        {
            break;
        }

        // Construct user content based on inputs
        var user_content = Common.GetUserContent(model.GetModelType(), num_images, num_audios, text);

        // Add user message to list of messages
        var user_message = new Dictionary<string, string>
        {
            { "role", "user" },
            { "content", user_content }
        };
        input_list.Add(user_message);

        // Set search options for generator params
        using GeneratorParams generatorParams = new GeneratorParams(model);
        Common.SetSearchOptions(generatorParams, generatorParamsArgs, verbose);

        // Initialize guidance if requested
        if (!string.IsNullOrEmpty(guidance_type) && !string.IsNullOrEmpty(guidance_data))
        {
            generatorParams.SetGuidance(guidance_type, guidance_data);
            if (verbose)
            {
                Console.WriteLine();
                Console.WriteLine($"Guidance type is: {guidance_type}");
                Console.WriteLine($"Guidance data is: \n{guidance_data}");
                Console.WriteLine();
            }
        }

        // Create generator
        using Generator generator = new Generator(model, generatorParams);
        if (verbose) Console.WriteLine("Generator created");

        // Apply chat template
        string prompt = "";
        try
        {
            string messages = JsonSerializer.Serialize(input_list);
            prompt = Common.ApplyChatTemplate(modelPath, tokenizer, messages, add_generation_prompt: true, tools);
        }
        catch
        {
            prompt = text;
        }
        if (verbose) Console.WriteLine($"Prompt: {prompt}");

        // Encode combined system + user prompt and append inputs to model
        using var inputTensors = processor.ProcessImagesAndAudios(prompt, images, audios);
        generator.SetInputs(inputTensors);

        // Run generation loop
        if (verbose) Console.WriteLine("Running generation loop...\n");
        Console.Write("Output: ");
        var watch = System.Diagnostics.Stopwatch.StartNew();
        while (true)
        {
            generator.GenerateNextToken();
            if (generator.IsDone())
            {
                break;
            }
            // Decode and print the next token
            Console.Write(tokenizerStream.Decode(generator.GetNextTokens()[0]));
        }
        watch.Stop();
        var runTimeInSeconds = watch.Elapsed.TotalSeconds;

        // Remove user message from list of messages
        input_list.RemoveAt(input_list.Count - 1);

        // Display output and timings
        var totalTokens = (int)generator.TokenCount();
        Console.WriteLine();
        Console.WriteLine($"Streaming Tokens: {totalTokens}, Time: {runTimeInSeconds:0.00}, Tokens per second: {totalTokens / runTimeInSeconds:0.00}");
        Console.WriteLine();

    } while (interactive);
}

Usage Examples

Process Images

# Analyze a single image
dotnet run --project ModelMM -- -m /path/to/model \
  --image_paths /path/to/image.jpg \
  --user_prompt "What do you see in this image?"

# Process multiple images
dotnet run --project ModelMM -- -m /path/to/model \
  --image_paths /path/to/image1.jpg /path/to/image2.jpg \
  --user_prompt "Compare these two images"

Process Audio

# Transcribe or analyze audio
dotnet run --project ModelMM -- -m /path/to/model \
  --audio_paths /path/to/audio.wav \
  --user_prompt "Transcribe this audio"

Combined Inputs

# Process both images and audio
dotnet run --project ModelMM -- -m /path/to/model \
  --image_paths /path/to/image.jpg \
  --audio_paths /path/to/audio.wav \
  --user_prompt "Describe the image and transcribe the audio"

Interactive Mode

# Interactive mode with prompted inputs
dotnet run --project ModelMM -- -m /path/to/model

How It Works

1. Load Media Inputs

The example loads images and audio files using the GenAI API:

// Load images
(images, num_images) = Common.GetUserImages(imagePaths, interactive);

// Load audios
(audios, num_audios) = Common.GetUserAudios(audioPaths, interactive);

2. Format Content for Model Type

Different models require different input formatting. The example automatically adapts:

var user_content = Common.GetUserContent(
    model.GetModelType(), 
    num_images, 
    num_audios, 
    text
);

Supported Model Types

Phi-3 Vision / Phi-3.5 Vision:

<|image_1|>
<|image_2|>
User prompt text

Phi-4 Multimodal:

<|image_1|>
<|image_2|>
<|audio_1|>
User prompt text

Qwen-2.5 VL / Fara:

<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|>User prompt text

Gemma-3 (Structured Content):

[
  {"type": "image"},
  {"type": "image"},
  {"type": "text", "text": "User prompt text"}
]

3. Process Multimodal Inputs

The MultiModalProcessor handles encoding of all input types:

using MultiModalProcessor processor = new MultiModalProcessor(model);
using var inputTensors = processor.ProcessImagesAndAudios(prompt, images, audios);
generator.SetInputs(inputTensors);

4. Generate Response

Once inputs are set, token generation works the same as text-only models:

while (true)
{
    generator.GenerateNextToken();
    if (generator.IsDone())
    {
        break;
    }
    Console.Write(tokenizerStream.Decode(generator.GetNextTokens()[0]));
}

Key Components

MultiModalProcessor

Processes images and audio into tensors that the model can consume:

using MultiModalProcessor processor = new MultiModalProcessor(model);

Images Class

Loads and manages image inputs:

var images = Images.Load(imagePaths.ToArray());

Audios Class

Loads and manages audio inputs:

var audios = Audios.Load(audioPaths.ToArray());

Media Input Methods

Interactive Mode

When running in interactive mode, the example prompts for file paths:

Image Path (comma separated; leave empty if no image): /path/to/img1.jpg, /path/to/img2.jpg
Audio Path (comma separated; leave empty if no audio): /path/to/audio.wav
Prompt (Use quit() to exit): What's in these files?

Command-Line Arguments

Specify paths directly via command-line arguments:

--image_paths /path/to/image1.jpg /path/to/image2.jpg
--audio_paths /path/to/audio1.wav /path/to/audio2.wav

File Path Formatting

Paths can be absolute or relative
Multiple paths separated by commas or spaces
Paths with spaces should be quoted
Files are validated before processing

Command-Line Options

Option	Alias	Description
`--model_path`	`-m`	Path to the model directory
`--image_paths`		Comma-separated image file paths
`--audio_paths`		Comma-separated audio file paths
`--execution_provider`	`-e`	Execution provider (cpu, cuda, etc.)
`--system_prompt`	`-sp`	System prompt for the conversation
`--user_prompt`	`-up`	User prompt text
`--verbose`	`-v`	Enable verbose logging
`--non_interactive`		Run once without interactive loop
`--temperature`	`-t`	Sampling temperature
`--top_p`	`-p`	Nucleus sampling probability
`--max_length`	`-l`	Maximum generation length

Supported Models

The example works with various multimodal models:

Phi-3 Vision - Image understanding
Phi-3.5 Vision - Enhanced image processing
Phi-4 Multimodal - Images and audio
Qwen-2.5 VL - Vision-language tasks
Fara - Multimodal understanding
Gemma-3 - Structured multimodal inputs

Error Handling

The example includes robust error handling:

File validation: Checks that media files exist before processing
Format detection: Automatically detects and handles different model types
Graceful fallbacks: Falls back to text-only if media loading fails
User feedback: Clear error messages for invalid inputs

Performance Considerations

Media loading: Images and audio are loaded on-demand per query
Memory management: Uses using statements for proper disposal
Streaming output: Displays tokens as generated for responsive UX
Timing metrics: Reports generation speed in tokens per second

Python Examples

C# Examples

C/C++ Examples

Overview

Key Features

Complete Implementation

Usage Examples

Process Images

Process Audio

Combined Inputs

Interactive Mode

How It Works

1. Load Media Inputs

2. Format Content for Model Type

Supported Model Types

3. Process Multimodal Inputs

4. Generate Response

Key Components

MultiModalProcessor

Images Class

Audios Class

Media Input Methods

Interactive Mode

Command-Line Arguments

File Path Formatting

Command-Line Options

Supported Models

Error Handling

Performance Considerations

See Also

Build docs developers (and LLMs) love

Python Examples

C# Examples

C/C++ Examples

​Overview

​Key Features

​Complete Implementation

​Usage Examples

​Process Images

​Process Audio

​Combined Inputs

​Interactive Mode

​How It Works

​1. Load Media Inputs

​2. Format Content for Model Type

​Supported Model Types

​3. Process Multimodal Inputs

​4. Generate Response

​Key Components

​MultiModalProcessor

​Images Class

​Audios Class

​Media Input Methods

​Interactive Mode

​Command-Line Arguments

​File Path Formatting

​Command-Line Options

​Supported Models

​Error Handling

​Performance Considerations

​See Also

Build docs developers (and LLMs) love

Overview

Key Features

Complete Implementation

Usage Examples

Process Images

Process Audio

Combined Inputs

Interactive Mode

How It Works

1. Load Media Inputs

2. Format Content for Model Type

Supported Model Types

3. Process Multimodal Inputs

4. Generate Response

Key Components

MultiModalProcessor

Images Class

Audios Class

Media Input Methods

Interactive Mode

Command-Line Arguments

File Path Formatting

Command-Line Options

Supported Models

Error Handling

Performance Considerations

See Also