Architecture Overview

Introduction

ONNX Runtime GenAI is a library designed to run generative AI models with ONNX Runtime. It implements the complete generative AI loop, including:

Pre and post processing
Inference with ONNX Runtime
Logits processing
Search and sampling
KV cache management
Grammar specification for tool calling

The library provides a high-level API that abstracts away the complexity of running generative models while maintaining flexibility and performance.

Key Components

The ONNX Runtime GenAI architecture consists of four primary components that work together to execute the generative AI loop:

Model

Manages the ONNX model, session options, and device configuration

Tokenizer

Handles text encoding/decoding and token stream processing

Generator

Orchestrates the generation loop and manages state

GeneratorParams

Configures search strategies and generation parameters

Model

The Model class (defined in src/models/model.h:145) is responsible for:

Loading and managing ONNX models from disk or memory
Creating and configuring ORT sessions with appropriate execution providers
Managing device allocation (CPU, CUDA, DirectML, etc.)
Providing tokenizer and processor creation

struct Model : std::enable_shared_from_this<Model> {
  std::unique_ptr<Config> config_;
  DeviceInterface* p_device_;          // Primary device
  DeviceInterface* p_device_inputs_;   // Device for inputs
  DeviceInterface* p_device_kvcache_;  // Device for KV cache
  SessionInfo session_info_;
};

The Model supports various execution providers including CPU, CUDA, DirectML, TensorRT, OpenVINO, QNN, and WebGPU.

Tokenizer

The Tokenizer class (defined in src/models/model.h:82) handles:

Text encoding to token IDs
Token decoding to text
Batch encoding/decoding
Chat template application
Streaming token decode with TokenizerStream

struct Tokenizer {
  std::vector<int32_t> Encode(const char* text) const;
  std::string Decode(std::span<const int32_t> tokens) const;
  std::unique_ptr<TokenizerStream> CreateStream() const;
};

Generator

The Generator class (defined in src/generators.h:99) is the central orchestrator that:

Manages the generation state
Executes the generation loop
Coordinates between search strategy and model inference
Handles token appending and sequence management

struct Generator {
  std::shared_ptr<const Model> model_;
  std::unique_ptr<State> state_;        // Model state and inference
  std::unique_ptr<Search> search_;      // Search strategy (greedy/beam)
  
  void AppendTokens(cpu_span<const int32_t> input_ids);
  void GenerateNextToken();
  bool IsDone();
  DeviceSpan<int32_t> GetSequence(size_t index) const;
};

GeneratorParams

The GeneratorParams class (defined in src/generators.h:71) configures:

Search parameters (beam size, max length, temperature, etc.)
Sampling options (top-k, top-p, temperature)
Device configuration
Guidance for constrained decoding

struct GeneratorParams {
  Config::Search search;  // Search configuration
  int max_batch_size;
  bool use_graph_capture;
  std::string guidance_type;   // e.g., json_schema or regex
  std::string guidance_data;
};

The Generative AI Loop

The core generation loop follows this pattern:

Initialization

Create Model, GeneratorParams, and Generator instances. Encode the input prompt using the Tokenizer.

Append Input Tokens

Feed the encoded prompt tokens to the Generator using AppendTokens() or AppendTokenSequences().

Generate Loop

Repeatedly call GenerateNextToken() until IsDone() returns true:

Run Inference: The State executes the model with current tokens
Get Logits: Extract output logits from the model
Apply Constraints: Process logits (min length, repetition penalty, guidance)
Search Strategy: Select next token(s) based on search method
Update State: Append selected token(s) and update KV cache
Check Termination: Test for EOS tokens or max length

Retrieve Output

Get the generated sequence(s) using GetSequence() and decode with the Tokenizer.

Example Flow Diagram

Component Relationships

The components interact in a hierarchical manner:

Model creates State instances that manage model execution
GeneratorParams configures both Generator and Search behavior
Generator owns State (model execution) and Search (token selection)
Search manages Sequences (token history for all beams/batches)
State manages KeyValueCache for efficient autoregressive generation

// From src/generators.cpp - Generator creation
Generator::Generator(const Model& model, const GeneratorParams& params)
  : model_{model.shared_from_this()},
    state_{model.CreateState(sequence_lengths, params)},
    search_{CreateSearch(params)} {
  // ...
}

Code Example

Here’s a complete example showing the component interaction:

import onnxruntime_genai as og

# Create Model
model = og.Model('model_path')

# Create Tokenizer
tokenizer = og.Tokenizer(model)

# Encode input
prompt = "What is the capital of France?"
input_tokens = tokenizer.encode(prompt)

# Configure generation
params = og.GeneratorParams(model)
params.set_search_options(max_length=100, top_k=50, temperature=0.7)

# Create Generator
generator = og.Generator(model, params)

# Run generation loop
generator.append_tokens(input_tokens)
while not generator.is_done():
    generator.generate_next_token()
    # Optionally stream tokens
    new_token = generator.get_next_tokens()[0]

# Get result
output_tokens = generator.get_sequence(0)
output_text = tokenizer.decode(output_tokens)
print(output_text)

Device Management

ONNX Runtime GenAI supports multiple device types for different components:

p_device: Primary computation device (CPU, CUDA, DirectML, etc.)
p_device_inputs: Device for model inputs (may differ from primary for some EPs)
p_device_kvcache: Device for KV cache storage (typically matches primary device)

The library automatically manages memory allocation and transfers between devices based on the execution provider configuration.

Next Steps

Models

Learn about supported model architectures and configuration

Generation

Explore search strategies and generation parameters

KV Cache

Understand KV cache management and optimization

API Reference

Browse the complete API documentation

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Introduction

Key Components

Model

Tokenizer

Generator

GeneratorParams

Model

Tokenizer

Generator

GeneratorParams

The Generative AI Loop

Example Flow Diagram

Component Relationships

Code Example

Device Management

Next Steps

Models

Generation

KV Cache

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Introduction

​Key Components

Model

Tokenizer

Generator

GeneratorParams

​Model

​Tokenizer

​Generator

​GeneratorParams

​The Generative AI Loop

​Example Flow Diagram

​Component Relationships

​Code Example

​Device Management

​Next Steps

Models

Generation

KV Cache

API Reference

Build docs developers (and LLMs) love

Introduction

Key Components

Model

Tokenizer

Generator

GeneratorParams

The Generative AI Loop

Example Flow Diagram

Component Relationships

Code Example

Device Management

Next Steps