Models

Supported Model Architectures

ONNX Runtime GenAI supports a wide range of decoder-only and encoder-decoder transformer architectures. The library is designed to work with models exported to ONNX format with the appropriate optimizations.

Decoder-Only Models

These architectures are currently supported (as of the source code):

Language Models

AMD OLMo - AMD’s open language model
ChatGLM - Bilingual conversational AI model
DeepSeek - Advanced reasoning models
ERNIE 4.5 - Baidu’s enhanced language model
Fara - Emerging architecture
Gemma - Google’s lightweight models
gpt-oss - Open source GPT variants
Granite - IBM’s enterprise models
InternLM2 - InternLM series models
Llama - Meta’s Llama family (2, 3, 3.1, etc.)
Mistral - Mistral AI models
Nemotron - NVIDIA’s language models
Phi - Microsoft’s small language models
Qwen - Alibaba’s Qwen series
SmolLM3 - Compact language models

Vision-Language Models

Phi-3 Vision - Microsoft’s multimodal model
Qwen-VL / Qwen2.5-VL - Vision-language models with image understanding

Audio Models

Whisper - OpenAI’s speech recognition model

Model Type Detection

The model type is specified in genai_config.json and determines which implementation is used:

{
  "model": {
    "type": "gpt2",  // or "llama", "phi", "whisper", etc.
    // ...
  }
}

The supported types are defined in src/models/model_type.h.

Model Configuration

Models are configured through a genai_config.json file that specifies the model architecture, token IDs, and runtime settings.

Configuration File Structure

The configuration is defined in src/config.h and includes:

{
  "model": {
    "type": "gpt2",
    "pad_token_id": 50256,
    "bos_token_id": 50256,
    "eos_token_id": 50256,
    "vocab_size": 50257,
    "context_length": 1024,
    "decoder": {
      "filename": "decoder_model.onnx",
      "num_hidden_layers": 12,
      "num_key_value_heads": 12,
      "head_size": 64
    }
  }
}

Key Configuration Sections

Model Section

From src/config.h:102:

type: Model architecture identifier
pad_token_id: Padding token ID
eos_token_id: End-of-sequence token ID(s) - can be single value or array
bos_token_id: Beginning-of-sequence token ID
vocab_size: Size of the vocabulary
context_length: Maximum sequence length the model supports

Decoder Section

From src/config.h:217:

filename: Path to the ONNX model file
num_hidden_layers: Number of transformer layers
num_key_value_heads: Number of KV heads (for GQA/MQA)
num_attention_heads: Number of attention heads
head_size: Dimension of each attention head
session_options: ORT session configuration
inputs/outputs: Custom input/output name mappings

Session Options

From src/config.h:80:

"session_options": {
  "log_severity_level": 3,
  "enable_profiling": "profile_output.json",
  "graph_optimization_level": 99,  // ORT_ENABLE_ALL
  "provider_options": [
    {
      "cuda": {
        "device_id": "0",
        "enable_cuda_graph": "1",
        "gpu_mem_limit": "4294967296"
      }
    }
  ]
}

Multimodal Models

For vision-language models, additional configuration sections are required:

{
  "model": {
    "type": "phi3v",
    "vision": {
      "filename": "vision_encoder.onnx",
      "inputs": {
        "pixel_values": "pixel_values",
        "image_sizes": "image_sizes"
      },
      "outputs": {
        "image_features": "image_features"
      }
    },
    "embedding": {
      "filename": "embedding_model.onnx",
      "inputs": {
        "input_ids": "input_ids",
        "image_features": "image_features"
      },
      "outputs": {
        "embeddings": "inputs_embeds"
      }
    },
    "decoder": {
      "filename": "decoder_model.onnx",
      "inputs": {
        "embeddings": "inputs_embeds"
      }
    }
  }
}

Model Loading and Management

Loading a Model

Models are loaded through the Model::Create API (defined in src/generators.h:163):

import onnxruntime_genai as og

# Load from directory containing genai_config.json
model = og.Model('path/to/model')

# Get model information
model_type = model.get_type()
device_type = model.get_device_type()

Advanced Model Loading

For more control over model loading:

import onnxruntime_genai as og

# Create custom config
config = og.Config('path/to/model')

# Modify execution providers
config.clear_providers()
config.append_provider('cuda')
config.set_provider_option('cuda', 'device_id', '0')

# Load model with config
model = og.Model(config)

Loading from Memory

Models can be loaded from memory buffers instead of files:

auto config = OgaConfig::Create("path/to/config");

// Add model data from memory
std::vector<std::byte> model_data = LoadModelFromSource();
config->AddModelData("decoder_model.onnx", model_data);

auto model = OgaModel::Create(*config);

Model Input/Output Naming

ONNX Runtime GenAI uses a flexible naming system for model inputs and outputs (from src/config.h:14):

Default Names

// Decoder inputs
"input_ids"              // Token IDs
"attention_mask"         // Attention mask
"position_ids"           // Position IDs
"past_key_values.%d.key" // KV cache keys
"past_key_values.%d.value" // KV cache values

// Decoder outputs
"logits"                 // Output logits
"present.%d.key"        // Updated KV cache keys
"present.%d.value"      // Updated KV cache values

Custom Naming

You can override default names in the config:

{
  "model": {
    "decoder": {
      "inputs": {
        "input_ids": "tokens",
        "past_names": "cache_%d"  // Combined key/value cache
      },
      "outputs": {
        "logits": "scores",
        "present_names": "new_cache_%d"
      }
    }
  }
}

Model Optimization and Quantization

Quantization Support

ONNX Runtime GenAI supports various quantization formats:

INT4 - 4-bit integer quantization (RTN, AWQ)
INT8 - 8-bit integer quantization
FP16 - Half precision floating point
FP32 - Full precision floating point

Quantized models are created using the model builder tools (see src/python/py/models/builder.py).

Graph Optimizations

Set optimization level in session options:

{
  "decoder": {
    "session_options": {
      "graph_optimization_level": 99  // ORT_ENABLE_ALL
    }
  }
}

Optimization levels:

1 - ORT_DISABLE_ALL
2 - ORT_ENABLE_BASIC
3 - ORT_ENABLE_EXTENDED
99 - ORT_ENABLE_ALL (recommended)

CUDA Graph Capture

For CUDA execution provider, enable graph capture for better performance:

{
  "search": {
    "past_present_share_buffer": true
  },
  "decoder": {
    "session_options": {
      "provider_options": [
        {
          "cuda": {
            "enable_cuda_graph": "1"
          }
        }
      ]
    }
  }
}

Requirements (from src/generators.h:96):

Must be enabled in config
Only works with num_beams=1 OR Whisper models
CUDA execution provider

Pipeline Models

Some models use a pipeline architecture with multiple ONNX files (from src/config.h:269):

{
  "model": {
    "decoder": {
      "pipeline": [
        {
          "filename": "stage1.onnx",
          "model_id": "stage1",
          "inputs": ["input_ids", "attention_mask"],
          "outputs": ["hidden_states"],
          "run_on_prompt": true,
          "run_on_token_gen": true
        },
        {
          "filename": "stage2.onnx",
          "model_id": "stage2",
          "inputs": ["hidden_states"],
          "outputs": ["logits"],
          "is_lm_head": true
        }
      ]
    }
  }
}

Model State Management

The State class (from src/models/model.h:24) manages model execution:

Maintains ORT session and I/O bindings
Manages KV cache lifecycle
Handles adapter switching (for LoRA)
Supports continuous decoding (rewinding)

struct State {
  virtual DeviceSpan<float> Run(int total_length, 
                                DeviceSpan<int32_t>& next_tokens,
                                DeviceSpan<int32_t> next_indices = {}) = 0;
  virtual void RewindTo(size_t index);  // For continuous decoding
  virtual OrtValue* GetInput(const char* name);
  virtual OrtValue* GetOutput(const char* name);
};

Next Steps

Generation

Learn about generation strategies and parameters

KV Cache

Understand KV cache management

Install

Install ONNX Runtime GenAI

Model Builder

Build and optimize your own models

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Supported Model Architectures

Decoder-Only Models

Model Type Detection

Model Configuration

Configuration File Structure

Key Configuration Sections

Model Section

Decoder Section

Session Options

Multimodal Models

Model Loading and Management

Loading a Model

Advanced Model Loading

Loading from Memory

Model Input/Output Naming

Default Names

Custom Naming

Model Optimization and Quantization

Quantization Support

Graph Optimizations

CUDA Graph Capture

Pipeline Models

Model State Management

Next Steps

Generation

KV Cache

Install

Model Builder

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Supported Model Architectures

​Decoder-Only Models

​Model Type Detection

​Model Configuration

​Configuration File Structure

​Key Configuration Sections

​Model Section

​Decoder Section

​Session Options

​Multimodal Models

​Model Loading and Management

​Loading a Model

​Advanced Model Loading

​Loading from Memory

​Model Input/Output Naming

​Default Names

​Custom Naming

​Model Optimization and Quantization

​Quantization Support

​Graph Optimizations

​CUDA Graph Capture

​Pipeline Models

​Model State Management

​Next Steps

Generation

KV Cache

Install

Model Builder

Build docs developers (and LLMs) love

Supported Model Architectures

Decoder-Only Models

Model Type Detection

Model Configuration

Configuration File Structure

Key Configuration Sections

Model Section

Decoder Section

Session Options

Multimodal Models

Model Loading and Management

Loading a Model

Advanced Model Loading

Loading from Memory

Model Input/Output Naming

Default Names

Custom Naming

Model Optimization and Quantization

Quantization Support

Graph Optimizations

CUDA Graph Capture

Pipeline Models

Model State Management

Next Steps