Introduction
ONNX Runtime GenAI is a library designed to run generative AI models with ONNX Runtime. It implements the complete generative AI loop, including:- Pre and post processing
- Inference with ONNX Runtime
- Logits processing
- Search and sampling
- KV cache management
- Grammar specification for tool calling
Key Components
The ONNX Runtime GenAI architecture consists of four primary components that work together to execute the generative AI loop:Model
Manages the ONNX model, session options, and device configuration
Tokenizer
Handles text encoding/decoding and token stream processing
Generator
Orchestrates the generation loop and manages state
GeneratorParams
Configures search strategies and generation parameters
Model
TheModel class (defined in src/models/model.h:145) is responsible for:
- Loading and managing ONNX models from disk or memory
- Creating and configuring ORT sessions with appropriate execution providers
- Managing device allocation (CPU, CUDA, DirectML, etc.)
- Providing tokenizer and processor creation
Tokenizer
TheTokenizer class (defined in src/models/model.h:82) handles:
- Text encoding to token IDs
- Token decoding to text
- Batch encoding/decoding
- Chat template application
- Streaming token decode with
TokenizerStream
Generator
TheGenerator class (defined in src/generators.h:99) is the central orchestrator that:
- Manages the generation state
- Executes the generation loop
- Coordinates between search strategy and model inference
- Handles token appending and sequence management
GeneratorParams
TheGeneratorParams class (defined in src/generators.h:71) configures:
- Search parameters (beam size, max length, temperature, etc.)
- Sampling options (top-k, top-p, temperature)
- Device configuration
- Guidance for constrained decoding
The Generative AI Loop
The core generation loop follows this pattern:Initialization
Create Model, GeneratorParams, and Generator instances. Encode the input prompt using the Tokenizer.
Append Input Tokens
Feed the encoded prompt tokens to the Generator using
AppendTokens() or AppendTokenSequences().Generate Loop
Repeatedly call
GenerateNextToken() until IsDone() returns true:- Run Inference: The State executes the model with current tokens
- Get Logits: Extract output logits from the model
- Apply Constraints: Process logits (min length, repetition penalty, guidance)
- Search Strategy: Select next token(s) based on search method
- Update State: Append selected token(s) and update KV cache
- Check Termination: Test for EOS tokens or max length
Example Flow Diagram
Component Relationships
The components interact in a hierarchical manner:- Model creates State instances that manage model execution
- GeneratorParams configures both Generator and Search behavior
- Generator owns State (model execution) and Search (token selection)
- Search manages Sequences (token history for all beams/batches)
- State manages KeyValueCache for efficient autoregressive generation
Code Example
Here’s a complete example showing the component interaction:Device Management
ONNX Runtime GenAI supports multiple device types for different components:- p_device: Primary computation device (CPU, CUDA, DirectML, etc.)
- p_device_inputs: Device for model inputs (may differ from primary for some EPs)
- p_device_kvcache: Device for KV cache storage (typically matches primary device)
Next Steps
Models
Learn about supported model architectures and configuration
Generation
Explore search strategies and generation parameters
KV Cache
Understand KV cache management and optimization
API Reference
Browse the complete API documentation