C# Chat Example

Overview

This example demonstrates how to implement a streaming chat application using ONNX Runtime GenAI in C#. The ModelChat example shows how to build an interactive conversational AI that maintains context across multiple turns and streams responses in real-time.

Key Features

Streaming responses: Tokens are generated and displayed in real-time
Conversation history: Maintains chat context across multiple turns
Rewind capability: Option to reset to the system prompt after each exchange
Guided generation: Support for JSON schema and grammar-based output formatting

Complete Implementation

The following code shows the complete ModelChat function that handles streaming chat interactions:

Program.cs

void ModelChat(
    Model model,
    Tokenizer tokenizer,
    TokenizerStream tokenizerStream,
    GeneratorParamsArgs generatorParamsArgs,
    GuidanceArgs guidanceArgs,
    string modelPath,
    string systemPrompt,
    string userPrompt,
    bool interactive,
    bool rewind,
    bool verbose
)
{
    // Set search options for generator params
    using GeneratorParams generatorParams = new GeneratorParams(model);
    Common.SetSearchOptions(generatorParams, generatorParamsArgs, verbose);

    // Create system message
    var system_message = new Dictionary<string, string>
    {
        { "role", "system" },
        { "content", systemPrompt }
    };

    // Get and set guidance info if requested
    string tools = "";
    if (!string.IsNullOrEmpty(guidanceArgs.response_format))
    {
        Console.WriteLine("Make sure your tool call start id and tool call end id are marked as special in tokenizer.json");
        string guidance_type = "";
        string guidance_data = "";
        (guidance_type, guidance_data, tools) = Common.GetGuidance(
            response_format: guidanceArgs.response_format,
            filepath: guidanceArgs.tools_file,
            text_output: guidanceArgs.text_output,
            tool_output: guidanceArgs.tool_output,
            tool_call_start: guidanceArgs.tool_call_start,
            tool_call_end: guidanceArgs.tool_call_end
        );
        system_message["tools"] = tools;

        generatorParams.SetGuidance(guidance_type, guidance_data);
        if (verbose)
        {
            Console.WriteLine();
            Console.WriteLine($"Guidance type is: {guidance_type}");
            Console.WriteLine($"Guidance data is: \n{guidance_data}");
            Console.WriteLine();
        }
    }

    // Create generator
    using Generator generator = new Generator(model, generatorParams);
    if (verbose) Console.WriteLine("Generator created");

    // Apply chat template
    string prompt = "";
    try
    {
        string messages = JsonSerializer.Serialize(new List<Dictionary<string, string>> { system_message });
        prompt = Common.ApplyChatTemplate(modelPath, tokenizer, messages, add_generation_prompt: false, tools);
    }
    catch
    {
        prompt = systemPrompt;
    }
    if (verbose) Console.WriteLine($"System prompt: {prompt}\n");

    // Encode system prompt and append tokens to model
    var sequences = tokenizer.Encode(prompt);
    generator.AppendTokenSequences(sequences);
    var system_prompt_length = (int)generator.TokenCount();

    // Streaming Chat
    var prevTotalTokens = 0;
    do
    {
        // Get user prompt
        string user_prompt = Common.GetUserPrompt(userPrompt, interactive);
        if (string.Compare(user_prompt, "quit()", StringComparison.OrdinalIgnoreCase) == 0)
        {
            break;
        }

        // Create user message
        var user_message = new Dictionary<string, string>
        {
            { "role", "user" },
            { "content", user_prompt }
        };

        // Apply chat template
        prompt = "";
        try
        {
            string messages = JsonSerializer.Serialize(new List<Dictionary<string, string>> { user_message });
            prompt = Common.ApplyChatTemplate(modelPath, tokenizer, messages, add_generation_prompt: true);
        }
        catch
        {
            prompt = systemPrompt;
        }
        if (verbose) Console.WriteLine($"User prompt: {prompt}");

        // Encode user prompt and append tokens to model
        sequences = tokenizer.Encode(prompt);
        generator.AppendTokenSequences(sequences);

        // Run generation loop
        if (verbose) Console.WriteLine("Running generation loop...\n");
        Console.Write("Output: ");
        var watch = System.Diagnostics.Stopwatch.StartNew();
        while (true)
        {
            generator.GenerateNextToken();
            if (generator.IsDone())
            {
                break;
            }
            Console.Write(tokenizerStream.Decode(generator.GetNextTokens()[0]));
        }
        watch.Stop();
        var runTimeInSeconds = watch.Elapsed.TotalSeconds;

        // Display output and timings
        var totalNewTokens = (int)generator.TokenCount() - prevTotalTokens;
        prevTotalTokens = totalNewTokens;
        Console.WriteLine();
        Console.WriteLine($"Streaming Tokens: {totalNewTokens}, Time: {runTimeInSeconds:0.00}, Tokens per second: {totalNewTokens / runTimeInSeconds:0.00}");
        Console.WriteLine();

        if (rewind)
        {
            generator.RewindTo((ulong)system_prompt_length);
        }

    } while (interactive);
}

Usage Example

Here’s how to run the ModelChat example:

# Basic usage
dotnet run --project ModelChat -- -m /path/to/model

# With custom prompts
dotnet run --project ModelChat -- -m /path/to/model \
  --system_prompt "You are a helpful AI assistant." \
  --user_prompt "What color is the sky?"

# Enable rewind mode (reset context after each turn)
dotnet run --project ModelChat -- -m /path/to/model --rewind

# Use specific execution provider
dotnet run --project ModelChat -- -m /path/to/model \
  -e cuda --ep_path /path/to/onnxruntime_providers_cuda.dll

# With guidance for structured output
dotnet run --project ModelChat -- -m /path/to/model \
  --response_format json_schema \
  --tools_file tools.json \
  --tool_output

How It Works

1. Initialize Generator

The function creates a Generator object with the specified model and parameters:

using GeneratorParams generatorParams = new GeneratorParams(model);
Common.SetSearchOptions(generatorParams, generatorParamsArgs, verbose);
using Generator generator = new Generator(model, generatorParams);

2. Process System Prompt

The system prompt is encoded and added to the generator once at the start:

var sequences = tokenizer.Encode(prompt);
generator.AppendTokenSequences(sequences);
var system_prompt_length = (int)generator.TokenCount();

3. Chat Loop

For each user message:

Get user input
Apply chat template
Encode and append to generator
Generate tokens one at a time
Stream decoded tokens to console
Optionally rewind to system prompt

4. Streaming Output

Tokens are decoded and displayed as they’re generated:

while (true)
{
    generator.GenerateNextToken();
    if (generator.IsDone())
    {
        break;
    }
    Console.Write(tokenizerStream.Decode(generator.GetNextTokens()[0]));
}

Key Components

GeneratorParams

Controls generation behavior:

max_length: Maximum sequence length
temperature: Sampling temperature
top_p: Nucleus sampling parameter
top_k: Top-k sampling parameter
do_sample: Enable random sampling

TokenizerStream

Handles streaming decoding of tokens as they’re generated, enabling real-time output display.

Guidance Support

The example supports structured output through guidance:

JSON Schema: Enforce JSON structure in responses
LARK Grammar: Use grammar rules for output formatting
Tool Calling: Generate function calls in specific formats

Command-Line Options

Option	Alias	Description
`--model_path`	`-m`	Path to the model directory
`--execution_provider`	`-e`	Execution provider (cpu, cuda, etc.)
`--system_prompt`	`-sp`	System prompt for the conversation
`--user_prompt`	`-up`	Initial user prompt (non-interactive mode)
`--rewind`	`-rw`	Reset to system prompt after each turn
`--verbose`	`-v`	Enable verbose logging
`--non_interactive`		Run once without interactive loop
`--temperature`	`-t`	Sampling temperature
`--top_p`	`-p`	Nucleus sampling probability
`--top_k`	`-k`	Top-k sampling parameter
`--max_length`	`-l`	Maximum generation length

Python Examples

C# Examples

C/C++ Examples

Overview

Key Features

Complete Implementation

Usage Example

How It Works

1. Initialize Generator

2. Process System Prompt

3. Chat Loop

4. Streaming Output

Key Components

GeneratorParams

TokenizerStream

Guidance Support

Command-Line Options

See Also

Build docs developers (and LLMs) love

Python Examples

C# Examples

C/C++ Examples

​Overview

​Key Features

​Complete Implementation

​Usage Example

​How It Works

​1. Initialize Generator

​2. Process System Prompt

​3. Chat Loop

​4. Streaming Output

​Key Components

​GeneratorParams

​TokenizerStream

​Guidance Support

​Command-Line Options

​See Also

Build docs developers (and LLMs) love

Overview

Key Features

Complete Implementation

Usage Example

How It Works

1. Initialize Generator

2. Process System Prompt

3. Chat Loop

4. Streaming Output

Key Components

GeneratorParams

TokenizerStream

Guidance Support

Command-Line Options

See Also