Skip to main content

Introduction

ONNX Runtime GenAI is a library designed to run generative AI models with ONNX Runtime. It implements the complete generative AI loop, including:
  • Pre and post processing
  • Inference with ONNX Runtime
  • Logits processing
  • Search and sampling
  • KV cache management
  • Grammar specification for tool calling
The library provides a high-level API that abstracts away the complexity of running generative models while maintaining flexibility and performance.

Key Components

The ONNX Runtime GenAI architecture consists of four primary components that work together to execute the generative AI loop:

Model

Manages the ONNX model, session options, and device configuration

Tokenizer

Handles text encoding/decoding and token stream processing

Generator

Orchestrates the generation loop and manages state

GeneratorParams

Configures search strategies and generation parameters

Model

The Model class (defined in src/models/model.h:145) is responsible for:
  • Loading and managing ONNX models from disk or memory
  • Creating and configuring ORT sessions with appropriate execution providers
  • Managing device allocation (CPU, CUDA, DirectML, etc.)
  • Providing tokenizer and processor creation
struct Model : std::enable_shared_from_this<Model> {
  std::unique_ptr<Config> config_;
  DeviceInterface* p_device_;          // Primary device
  DeviceInterface* p_device_inputs_;   // Device for inputs
  DeviceInterface* p_device_kvcache_;  // Device for KV cache
  SessionInfo session_info_;
};
The Model supports various execution providers including CPU, CUDA, DirectML, TensorRT, OpenVINO, QNN, and WebGPU.

Tokenizer

The Tokenizer class (defined in src/models/model.h:82) handles:
  • Text encoding to token IDs
  • Token decoding to text
  • Batch encoding/decoding
  • Chat template application
  • Streaming token decode with TokenizerStream
struct Tokenizer {
  std::vector<int32_t> Encode(const char* text) const;
  std::string Decode(std::span<const int32_t> tokens) const;
  std::unique_ptr<TokenizerStream> CreateStream() const;
};

Generator

The Generator class (defined in src/generators.h:99) is the central orchestrator that:
  • Manages the generation state
  • Executes the generation loop
  • Coordinates between search strategy and model inference
  • Handles token appending and sequence management
struct Generator {
  std::shared_ptr<const Model> model_;
  std::unique_ptr<State> state_;        // Model state and inference
  std::unique_ptr<Search> search_;      // Search strategy (greedy/beam)
  
  void AppendTokens(cpu_span<const int32_t> input_ids);
  void GenerateNextToken();
  bool IsDone();
  DeviceSpan<int32_t> GetSequence(size_t index) const;
};

GeneratorParams

The GeneratorParams class (defined in src/generators.h:71) configures:
  • Search parameters (beam size, max length, temperature, etc.)
  • Sampling options (top-k, top-p, temperature)
  • Device configuration
  • Guidance for constrained decoding
struct GeneratorParams {
  Config::Search search;  // Search configuration
  int max_batch_size;
  bool use_graph_capture;
  std::string guidance_type;   // e.g., json_schema or regex
  std::string guidance_data;
};

The Generative AI Loop

The core generation loop follows this pattern:
1

Initialization

Create Model, GeneratorParams, and Generator instances. Encode the input prompt using the Tokenizer.
2

Append Input Tokens

Feed the encoded prompt tokens to the Generator using AppendTokens() or AppendTokenSequences().
3

Generate Loop

Repeatedly call GenerateNextToken() until IsDone() returns true:
  1. Run Inference: The State executes the model with current tokens
  2. Get Logits: Extract output logits from the model
  3. Apply Constraints: Process logits (min length, repetition penalty, guidance)
  4. Search Strategy: Select next token(s) based on search method
  5. Update State: Append selected token(s) and update KV cache
  6. Check Termination: Test for EOS tokens or max length
4

Retrieve Output

Get the generated sequence(s) using GetSequence() and decode with the Tokenizer.

Example Flow Diagram

Component Relationships

The components interact in a hierarchical manner:
  • Model creates State instances that manage model execution
  • GeneratorParams configures both Generator and Search behavior
  • Generator owns State (model execution) and Search (token selection)
  • Search manages Sequences (token history for all beams/batches)
  • State manages KeyValueCache for efficient autoregressive generation
// From src/generators.cpp - Generator creation
Generator::Generator(const Model& model, const GeneratorParams& params)
  : model_{model.shared_from_this()},
    state_{model.CreateState(sequence_lengths, params)},
    search_{CreateSearch(params)} {
  // ...
}

Code Example

Here’s a complete example showing the component interaction:
import onnxruntime_genai as og

# Create Model
model = og.Model('model_path')

# Create Tokenizer
tokenizer = og.Tokenizer(model)

# Encode input
prompt = "What is the capital of France?"
input_tokens = tokenizer.encode(prompt)

# Configure generation
params = og.GeneratorParams(model)
params.set_search_options(max_length=100, top_k=50, temperature=0.7)

# Create Generator
generator = og.Generator(model, params)

# Run generation loop
generator.append_tokens(input_tokens)
while not generator.is_done():
    generator.generate_next_token()
    # Optionally stream tokens
    new_token = generator.get_next_tokens()[0]

# Get result
output_tokens = generator.get_sequence(0)
output_text = tokenizer.decode(output_tokens)
print(output_text)

Device Management

ONNX Runtime GenAI supports multiple device types for different components:
  • p_device: Primary computation device (CPU, CUDA, DirectML, etc.)
  • p_device_inputs: Device for model inputs (may differ from primary for some EPs)
  • p_device_kvcache: Device for KV cache storage (typically matches primary device)
The library automatically manages memory allocation and transfers between devices based on the execution provider configuration.

Next Steps

Models

Learn about supported model architectures and configuration

Generation

Explore search strategies and generation parameters

KV Cache

Understand KV cache management and optimization

API Reference

Browse the complete API documentation

Build docs developers (and LLMs) love