Skip to main content

Overview

This example demonstrates how to build a multimodal AI application using ONNX Runtime GenAI in C#. The ModelMM example shows how to process images and audio inputs alongside text prompts, enabling vision-language and audio-language model interactions.

Key Features

  • Multi-modal input processing: Handle images, audio, and text
  • Streaming responses: Real-time token generation and display
  • Multiple input formats: Support for various image and audio formats
  • Model-specific formatting: Automatic adaptation to different model types (Phi-3, Phi-4, Qwen, Gemma)

Complete Implementation

The following code shows the complete ModelMM function for multimodal interactions:
Program.cs
void ModelMM(
    Model model,
    Tokenizer tokenizer,
    TokenizerStream tokenizerStream,
    MultiModalProcessor processor,
    GeneratorParamsArgs generatorParamsArgs,
    GuidanceArgs guidanceArgs,
    List<string> imagePaths,
    List<string> audioPaths,
    string modelPath,
    string systemPrompt,
    string userPrompt,
    bool interactive,
    bool verbose
)
{
    // Creating running list of messages
    var system_message = new Dictionary<string, string>
    {
        { "role", "system" },
        { "content", systemPrompt }
    };
    var input_list = new List<Dictionary<string, string>>() { system_message };

    // Get and set guidance info if requested
    string guidance_type = "";
    string guidance_data = "";
    string tools = "";
    if (!string.IsNullOrEmpty(guidanceArgs.response_format))
    {
        Console.WriteLine("Make sure your tool call start id and tool call end id are marked as special in tokenizer.json");
        (guidance_type, guidance_data, tools) = Common.GetGuidance(
            response_format: guidanceArgs.response_format,
            filepath: guidanceArgs.tools_file,
            text_output: guidanceArgs.text_output,
            tool_output: guidanceArgs.tool_output,
            tool_call_start: guidanceArgs.tool_call_start,
            tool_call_end: guidanceArgs.tool_call_end
        );
        input_list[0]["tools"] = tools;
    }

    // Streaming Q&A
    do
    {
        // Get images
        Images? images;
        int num_images;
        (images, num_images) = Common.GetUserImages(imagePaths, interactive);

        // Get audios
        Audios? audios;
        int num_audios;
        (audios, num_audios) = Common.GetUserAudios(audioPaths, interactive);

        // Get user prompt
        string text = Common.GetUserPrompt(userPrompt, interactive);
        if (string.Compare(text, "quit()", StringComparison.OrdinalIgnoreCase) == 0)
        {
            break;
        }

        // Construct user content based on inputs
        var user_content = Common.GetUserContent(model.GetModelType(), num_images, num_audios, text);

        // Add user message to list of messages
        var user_message = new Dictionary<string, string>
        {
            { "role", "user" },
            { "content", user_content }
        };
        input_list.Add(user_message);

        // Set search options for generator params
        using GeneratorParams generatorParams = new GeneratorParams(model);
        Common.SetSearchOptions(generatorParams, generatorParamsArgs, verbose);

        // Initialize guidance if requested
        if (!string.IsNullOrEmpty(guidance_type) && !string.IsNullOrEmpty(guidance_data))
        {
            generatorParams.SetGuidance(guidance_type, guidance_data);
            if (verbose)
            {
                Console.WriteLine();
                Console.WriteLine($"Guidance type is: {guidance_type}");
                Console.WriteLine($"Guidance data is: \n{guidance_data}");
                Console.WriteLine();
            }
        }

        // Create generator
        using Generator generator = new Generator(model, generatorParams);
        if (verbose) Console.WriteLine("Generator created");

        // Apply chat template
        string prompt = "";
        try
        {
            string messages = JsonSerializer.Serialize(input_list);
            prompt = Common.ApplyChatTemplate(modelPath, tokenizer, messages, add_generation_prompt: true, tools);
        }
        catch
        {
            prompt = text;
        }
        if (verbose) Console.WriteLine($"Prompt: {prompt}");

        // Encode combined system + user prompt and append inputs to model
        using var inputTensors = processor.ProcessImagesAndAudios(prompt, images, audios);
        generator.SetInputs(inputTensors);

        // Run generation loop
        if (verbose) Console.WriteLine("Running generation loop...\n");
        Console.Write("Output: ");
        var watch = System.Diagnostics.Stopwatch.StartNew();
        while (true)
        {
            generator.GenerateNextToken();
            if (generator.IsDone())
            {
                break;
            }
            // Decode and print the next token
            Console.Write(tokenizerStream.Decode(generator.GetNextTokens()[0]));
        }
        watch.Stop();
        var runTimeInSeconds = watch.Elapsed.TotalSeconds;

        // Remove user message from list of messages
        input_list.RemoveAt(input_list.Count - 1);

        // Display output and timings
        var totalTokens = (int)generator.TokenCount();
        Console.WriteLine();
        Console.WriteLine($"Streaming Tokens: {totalTokens}, Time: {runTimeInSeconds:0.00}, Tokens per second: {totalTokens / runTimeInSeconds:0.00}");
        Console.WriteLine();

    } while (interactive);
}

Usage Examples

Process Images

# Analyze a single image
dotnet run --project ModelMM -- -m /path/to/model \
  --image_paths /path/to/image.jpg \
  --user_prompt "What do you see in this image?"

# Process multiple images
dotnet run --project ModelMM -- -m /path/to/model \
  --image_paths /path/to/image1.jpg /path/to/image2.jpg \
  --user_prompt "Compare these two images"

Process Audio

# Transcribe or analyze audio
dotnet run --project ModelMM -- -m /path/to/model \
  --audio_paths /path/to/audio.wav \
  --user_prompt "Transcribe this audio"

Combined Inputs

# Process both images and audio
dotnet run --project ModelMM -- -m /path/to/model \
  --image_paths /path/to/image.jpg \
  --audio_paths /path/to/audio.wav \
  --user_prompt "Describe the image and transcribe the audio"

Interactive Mode

# Interactive mode with prompted inputs
dotnet run --project ModelMM -- -m /path/to/model

How It Works

1. Load Media Inputs

The example loads images and audio files using the GenAI API:
// Load images
(images, num_images) = Common.GetUserImages(imagePaths, interactive);

// Load audios
(audios, num_audios) = Common.GetUserAudios(audioPaths, interactive);

2. Format Content for Model Type

Different models require different input formatting. The example automatically adapts:
var user_content = Common.GetUserContent(
    model.GetModelType(), 
    num_images, 
    num_audios, 
    text
);

Supported Model Types

Phi-3 Vision / Phi-3.5 Vision:
<|image_1|>
<|image_2|>
User prompt text
Phi-4 Multimodal:
<|image_1|>
<|image_2|>
<|audio_1|>
User prompt text
Qwen-2.5 VL / Fara:
<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|>User prompt text
Gemma-3 (Structured Content):
[
  {"type": "image"},
  {"type": "image"},
  {"type": "text", "text": "User prompt text"}
]

3. Process Multimodal Inputs

The MultiModalProcessor handles encoding of all input types:
using MultiModalProcessor processor = new MultiModalProcessor(model);
using var inputTensors = processor.ProcessImagesAndAudios(prompt, images, audios);
generator.SetInputs(inputTensors);

4. Generate Response

Once inputs are set, token generation works the same as text-only models:
while (true)
{
    generator.GenerateNextToken();
    if (generator.IsDone())
    {
        break;
    }
    Console.Write(tokenizerStream.Decode(generator.GetNextTokens()[0]));
}

Key Components

MultiModalProcessor

Processes images and audio into tensors that the model can consume:
using MultiModalProcessor processor = new MultiModalProcessor(model);

Images Class

Loads and manages image inputs:
var images = Images.Load(imagePaths.ToArray());

Audios Class

Loads and manages audio inputs:
var audios = Audios.Load(audioPaths.ToArray());

Media Input Methods

Interactive Mode

When running in interactive mode, the example prompts for file paths:
Image Path (comma separated; leave empty if no image): /path/to/img1.jpg, /path/to/img2.jpg
Audio Path (comma separated; leave empty if no audio): /path/to/audio.wav
Prompt (Use quit() to exit): What's in these files?

Command-Line Arguments

Specify paths directly via command-line arguments:
--image_paths /path/to/image1.jpg /path/to/image2.jpg
--audio_paths /path/to/audio1.wav /path/to/audio2.wav

File Path Formatting

  • Paths can be absolute or relative
  • Multiple paths separated by commas or spaces
  • Paths with spaces should be quoted
  • Files are validated before processing

Command-Line Options

OptionAliasDescription
--model_path-mPath to the model directory
--image_pathsComma-separated image file paths
--audio_pathsComma-separated audio file paths
--execution_provider-eExecution provider (cpu, cuda, etc.)
--system_prompt-spSystem prompt for the conversation
--user_prompt-upUser prompt text
--verbose-vEnable verbose logging
--non_interactiveRun once without interactive loop
--temperature-tSampling temperature
--top_p-pNucleus sampling probability
--max_length-lMaximum generation length

Supported Models

The example works with various multimodal models:
  • Phi-3 Vision - Image understanding
  • Phi-3.5 Vision - Enhanced image processing
  • Phi-4 Multimodal - Images and audio
  • Qwen-2.5 VL - Vision-language tasks
  • Fara - Multimodal understanding
  • Gemma-3 - Structured multimodal inputs

Error Handling

The example includes robust error handling:
  • File validation: Checks that media files exist before processing
  • Format detection: Automatically detects and handles different model types
  • Graceful fallbacks: Falls back to text-only if media loading fails
  • User feedback: Clear error messages for invalid inputs

Performance Considerations

  • Media loading: Images and audio are loaded on-demand per query
  • Memory management: Uses using statements for proper disposal
  • Streaming output: Displays tokens as generated for responsive UX
  • Timing metrics: Reports generation speed in tokens per second

See Also

Build docs developers (and LLMs) love