Overview
This example demonstrates how to implement a streaming chat application using ONNX Runtime GenAI in C#. TheModelChat example shows how to build an interactive conversational AI that maintains context across multiple turns and streams responses in real-time.
Key Features
- Streaming responses: Tokens are generated and displayed in real-time
- Conversation history: Maintains chat context across multiple turns
- Rewind capability: Option to reset to the system prompt after each exchange
- Guided generation: Support for JSON schema and grammar-based output formatting
Complete Implementation
The following code shows the completeModelChat function that handles streaming chat interactions:
Program.cs
Usage Example
Here’s how to run the ModelChat example:How It Works
1. Initialize Generator
The function creates aGenerator object with the specified model and parameters:
2. Process System Prompt
The system prompt is encoded and added to the generator once at the start:3. Chat Loop
For each user message:- Get user input
- Apply chat template
- Encode and append to generator
- Generate tokens one at a time
- Stream decoded tokens to console
- Optionally rewind to system prompt
4. Streaming Output
Tokens are decoded and displayed as they’re generated:Key Components
GeneratorParams
Controls generation behavior:max_length: Maximum sequence lengthtemperature: Sampling temperaturetop_p: Nucleus sampling parametertop_k: Top-k sampling parameterdo_sample: Enable random sampling
TokenizerStream
Handles streaming decoding of tokens as they’re generated, enabling real-time output display.Guidance Support
The example supports structured output through guidance:- JSON Schema: Enforce JSON structure in responses
- LARK Grammar: Use grammar rules for output formatting
- Tool Calling: Generate function calls in specific formats
Command-Line Options
| Option | Alias | Description |
|---|---|---|
--model_path | -m | Path to the model directory |
--execution_provider | -e | Execution provider (cpu, cuda, etc.) |
--system_prompt | -sp | System prompt for the conversation |
--user_prompt | -up | Initial user prompt (non-interactive mode) |
--rewind | -rw | Reset to system prompt after each turn |
--verbose | -v | Enable verbose logging |
--non_interactive | Run once without interactive loop | |
--temperature | -t | Sampling temperature |
--top_p | -p | Nucleus sampling probability |
--top_k | -k | Top-k sampling parameter |
--max_length | -l | Maximum generation length |
See Also
- C# Multimodal Example - Process images and audio
- Generator API - Complete API reference
- Tokenizer API - Tokenization reference