Skip to main content

Overview

This example demonstrates how to implement a streaming chat application using ONNX Runtime GenAI in C#. The ModelChat example shows how to build an interactive conversational AI that maintains context across multiple turns and streams responses in real-time.

Key Features

  • Streaming responses: Tokens are generated and displayed in real-time
  • Conversation history: Maintains chat context across multiple turns
  • Rewind capability: Option to reset to the system prompt after each exchange
  • Guided generation: Support for JSON schema and grammar-based output formatting

Complete Implementation

The following code shows the complete ModelChat function that handles streaming chat interactions:
Program.cs
void ModelChat(
    Model model,
    Tokenizer tokenizer,
    TokenizerStream tokenizerStream,
    GeneratorParamsArgs generatorParamsArgs,
    GuidanceArgs guidanceArgs,
    string modelPath,
    string systemPrompt,
    string userPrompt,
    bool interactive,
    bool rewind,
    bool verbose
)
{
    // Set search options for generator params
    using GeneratorParams generatorParams = new GeneratorParams(model);
    Common.SetSearchOptions(generatorParams, generatorParamsArgs, verbose);

    // Create system message
    var system_message = new Dictionary<string, string>
    {
        { "role", "system" },
        { "content", systemPrompt }
    };

    // Get and set guidance info if requested
    string tools = "";
    if (!string.IsNullOrEmpty(guidanceArgs.response_format))
    {
        Console.WriteLine("Make sure your tool call start id and tool call end id are marked as special in tokenizer.json");
        string guidance_type = "";
        string guidance_data = "";
        (guidance_type, guidance_data, tools) = Common.GetGuidance(
            response_format: guidanceArgs.response_format,
            filepath: guidanceArgs.tools_file,
            text_output: guidanceArgs.text_output,
            tool_output: guidanceArgs.tool_output,
            tool_call_start: guidanceArgs.tool_call_start,
            tool_call_end: guidanceArgs.tool_call_end
        );
        system_message["tools"] = tools;

        generatorParams.SetGuidance(guidance_type, guidance_data);
        if (verbose)
        {
            Console.WriteLine();
            Console.WriteLine($"Guidance type is: {guidance_type}");
            Console.WriteLine($"Guidance data is: \n{guidance_data}");
            Console.WriteLine();
        }
    }

    // Create generator
    using Generator generator = new Generator(model, generatorParams);
    if (verbose) Console.WriteLine("Generator created");

    // Apply chat template
    string prompt = "";
    try
    {
        string messages = JsonSerializer.Serialize(new List<Dictionary<string, string>> { system_message });
        prompt = Common.ApplyChatTemplate(modelPath, tokenizer, messages, add_generation_prompt: false, tools);
    }
    catch
    {
        prompt = systemPrompt;
    }
    if (verbose) Console.WriteLine($"System prompt: {prompt}\n");

    // Encode system prompt and append tokens to model
    var sequences = tokenizer.Encode(prompt);
    generator.AppendTokenSequences(sequences);
    var system_prompt_length = (int)generator.TokenCount();

    // Streaming Chat
    var prevTotalTokens = 0;
    do
    {
        // Get user prompt
        string user_prompt = Common.GetUserPrompt(userPrompt, interactive);
        if (string.Compare(user_prompt, "quit()", StringComparison.OrdinalIgnoreCase) == 0)
        {
            break;
        }

        // Create user message
        var user_message = new Dictionary<string, string>
        {
            { "role", "user" },
            { "content", user_prompt }
        };

        // Apply chat template
        prompt = "";
        try
        {
            string messages = JsonSerializer.Serialize(new List<Dictionary<string, string>> { user_message });
            prompt = Common.ApplyChatTemplate(modelPath, tokenizer, messages, add_generation_prompt: true);
        }
        catch
        {
            prompt = systemPrompt;
        }
        if (verbose) Console.WriteLine($"User prompt: {prompt}");

        // Encode user prompt and append tokens to model
        sequences = tokenizer.Encode(prompt);
        generator.AppendTokenSequences(sequences);

        // Run generation loop
        if (verbose) Console.WriteLine("Running generation loop...\n");
        Console.Write("Output: ");
        var watch = System.Diagnostics.Stopwatch.StartNew();
        while (true)
        {
            generator.GenerateNextToken();
            if (generator.IsDone())
            {
                break;
            }
            Console.Write(tokenizerStream.Decode(generator.GetNextTokens()[0]));
        }
        watch.Stop();
        var runTimeInSeconds = watch.Elapsed.TotalSeconds;

        // Display output and timings
        var totalNewTokens = (int)generator.TokenCount() - prevTotalTokens;
        prevTotalTokens = totalNewTokens;
        Console.WriteLine();
        Console.WriteLine($"Streaming Tokens: {totalNewTokens}, Time: {runTimeInSeconds:0.00}, Tokens per second: {totalNewTokens / runTimeInSeconds:0.00}");
        Console.WriteLine();

        if (rewind)
        {
            generator.RewindTo((ulong)system_prompt_length);
        }

    } while (interactive);
}

Usage Example

Here’s how to run the ModelChat example:
# Basic usage
dotnet run --project ModelChat -- -m /path/to/model

# With custom prompts
dotnet run --project ModelChat -- -m /path/to/model \
  --system_prompt "You are a helpful AI assistant." \
  --user_prompt "What color is the sky?"

# Enable rewind mode (reset context after each turn)
dotnet run --project ModelChat -- -m /path/to/model --rewind

# Use specific execution provider
dotnet run --project ModelChat -- -m /path/to/model \
  -e cuda --ep_path /path/to/onnxruntime_providers_cuda.dll

# With guidance for structured output
dotnet run --project ModelChat -- -m /path/to/model \
  --response_format json_schema \
  --tools_file tools.json \
  --tool_output

How It Works

1. Initialize Generator

The function creates a Generator object with the specified model and parameters:
using GeneratorParams generatorParams = new GeneratorParams(model);
Common.SetSearchOptions(generatorParams, generatorParamsArgs, verbose);
using Generator generator = new Generator(model, generatorParams);

2. Process System Prompt

The system prompt is encoded and added to the generator once at the start:
var sequences = tokenizer.Encode(prompt);
generator.AppendTokenSequences(sequences);
var system_prompt_length = (int)generator.TokenCount();

3. Chat Loop

For each user message:
  1. Get user input
  2. Apply chat template
  3. Encode and append to generator
  4. Generate tokens one at a time
  5. Stream decoded tokens to console
  6. Optionally rewind to system prompt

4. Streaming Output

Tokens are decoded and displayed as they’re generated:
while (true)
{
    generator.GenerateNextToken();
    if (generator.IsDone())
    {
        break;
    }
    Console.Write(tokenizerStream.Decode(generator.GetNextTokens()[0]));
}

Key Components

GeneratorParams

Controls generation behavior:
  • max_length: Maximum sequence length
  • temperature: Sampling temperature
  • top_p: Nucleus sampling parameter
  • top_k: Top-k sampling parameter
  • do_sample: Enable random sampling

TokenizerStream

Handles streaming decoding of tokens as they’re generated, enabling real-time output display.

Guidance Support

The example supports structured output through guidance:
  • JSON Schema: Enforce JSON structure in responses
  • LARK Grammar: Use grammar rules for output formatting
  • Tool Calling: Generate function calls in specific formats

Command-Line Options

OptionAliasDescription
--model_path-mPath to the model directory
--execution_provider-eExecution provider (cpu, cuda, etc.)
--system_prompt-spSystem prompt for the conversation
--user_prompt-upInitial user prompt (non-interactive mode)
--rewind-rwReset to system prompt after each turn
--verbose-vEnable verbose logging
--non_interactiveRun once without interactive loop
--temperature-tSampling temperature
--top_p-pNucleus sampling probability
--top_k-kTop-k sampling parameter
--max_length-lMaximum generation length

See Also

Build docs developers (and LLMs) love