Skip to main content

Supported Model Architectures

ONNX Runtime GenAI supports a wide range of decoder-only and encoder-decoder transformer architectures. The library is designed to work with models exported to ONNX format with the appropriate optimizations.

Decoder-Only Models

These architectures are currently supported (as of the source code):
  • AMD OLMo - AMD’s open language model
  • ChatGLM - Bilingual conversational AI model
  • DeepSeek - Advanced reasoning models
  • ERNIE 4.5 - Baidu’s enhanced language model
  • Fara - Emerging architecture
  • Gemma - Google’s lightweight models
  • gpt-oss - Open source GPT variants
  • Granite - IBM’s enterprise models
  • InternLM2 - InternLM series models
  • Llama - Meta’s Llama family (2, 3, 3.1, etc.)
  • Mistral - Mistral AI models
  • Nemotron - NVIDIA’s language models
  • Phi - Microsoft’s small language models
  • Qwen - Alibaba’s Qwen series
  • SmolLM3 - Compact language models
  • Phi-3 Vision - Microsoft’s multimodal model
  • Qwen-VL / Qwen2.5-VL - Vision-language models with image understanding
  • Whisper - OpenAI’s speech recognition model

Model Type Detection

The model type is specified in genai_config.json and determines which implementation is used:
{
  "model": {
    "type": "gpt2",  // or "llama", "phi", "whisper", etc.
    // ...
  }
}
The supported types are defined in src/models/model_type.h.

Model Configuration

Models are configured through a genai_config.json file that specifies the model architecture, token IDs, and runtime settings.

Configuration File Structure

The configuration is defined in src/config.h and includes:
{
  "model": {
    "type": "gpt2",
    "pad_token_id": 50256,
    "bos_token_id": 50256,
    "eos_token_id": 50256,
    "vocab_size": 50257,
    "context_length": 1024,
    "decoder": {
      "filename": "decoder_model.onnx",
      "num_hidden_layers": 12,
      "num_key_value_heads": 12,
      "head_size": 64
    }
  }
}

Key Configuration Sections

Model Section

From src/config.h:102:
  • type: Model architecture identifier
  • pad_token_id: Padding token ID
  • eos_token_id: End-of-sequence token ID(s) - can be single value or array
  • bos_token_id: Beginning-of-sequence token ID
  • vocab_size: Size of the vocabulary
  • context_length: Maximum sequence length the model supports

Decoder Section

From src/config.h:217:
  • filename: Path to the ONNX model file
  • num_hidden_layers: Number of transformer layers
  • num_key_value_heads: Number of KV heads (for GQA/MQA)
  • num_attention_heads: Number of attention heads
  • head_size: Dimension of each attention head
  • session_options: ORT session configuration
  • inputs/outputs: Custom input/output name mappings

Session Options

From src/config.h:80:
"session_options": {
  "log_severity_level": 3,
  "enable_profiling": "profile_output.json",
  "graph_optimization_level": 99,  // ORT_ENABLE_ALL
  "provider_options": [
    {
      "cuda": {
        "device_id": "0",
        "enable_cuda_graph": "1",
        "gpu_mem_limit": "4294967296"
      }
    }
  ]
}

Multimodal Models

For vision-language models, additional configuration sections are required:
{
  "model": {
    "type": "phi3v",
    "vision": {
      "filename": "vision_encoder.onnx",
      "inputs": {
        "pixel_values": "pixel_values",
        "image_sizes": "image_sizes"
      },
      "outputs": {
        "image_features": "image_features"
      }
    },
    "embedding": {
      "filename": "embedding_model.onnx",
      "inputs": {
        "input_ids": "input_ids",
        "image_features": "image_features"
      },
      "outputs": {
        "embeddings": "inputs_embeds"
      }
    },
    "decoder": {
      "filename": "decoder_model.onnx",
      "inputs": {
        "embeddings": "inputs_embeds"
      }
    }
  }
}

Model Loading and Management

Loading a Model

Models are loaded through the Model::Create API (defined in src/generators.h:163):
import onnxruntime_genai as og

# Load from directory containing genai_config.json
model = og.Model('path/to/model')

# Get model information
model_type = model.get_type()
device_type = model.get_device_type()

Advanced Model Loading

For more control over model loading:
import onnxruntime_genai as og

# Create custom config
config = og.Config('path/to/model')

# Modify execution providers
config.clear_providers()
config.append_provider('cuda')
config.set_provider_option('cuda', 'device_id', '0')

# Load model with config
model = og.Model(config)

Loading from Memory

Models can be loaded from memory buffers instead of files:
auto config = OgaConfig::Create("path/to/config");

// Add model data from memory
std::vector<std::byte> model_data = LoadModelFromSource();
config->AddModelData("decoder_model.onnx", model_data);

auto model = OgaModel::Create(*config);

Model Input/Output Naming

ONNX Runtime GenAI uses a flexible naming system for model inputs and outputs (from src/config.h:14):

Default Names

// Decoder inputs
"input_ids"              // Token IDs
"attention_mask"         // Attention mask
"position_ids"           // Position IDs
"past_key_values.%d.key" // KV cache keys
"past_key_values.%d.value" // KV cache values

// Decoder outputs
"logits"                 // Output logits
"present.%d.key"        // Updated KV cache keys
"present.%d.value"      // Updated KV cache values

Custom Naming

You can override default names in the config:
{
  "model": {
    "decoder": {
      "inputs": {
        "input_ids": "tokens",
        "past_names": "cache_%d"  // Combined key/value cache
      },
      "outputs": {
        "logits": "scores",
        "present_names": "new_cache_%d"
      }
    }
  }
}

Model Optimization and Quantization

Quantization Support

ONNX Runtime GenAI supports various quantization formats:
  • INT4 - 4-bit integer quantization (RTN, AWQ)
  • INT8 - 8-bit integer quantization
  • FP16 - Half precision floating point
  • FP32 - Full precision floating point
Quantized models are created using the model builder tools (see src/python/py/models/builder.py).

Graph Optimizations

Set optimization level in session options:
{
  "decoder": {
    "session_options": {
      "graph_optimization_level": 99  // ORT_ENABLE_ALL
    }
  }
}
Optimization levels:
  • 1 - ORT_DISABLE_ALL
  • 2 - ORT_ENABLE_BASIC
  • 3 - ORT_ENABLE_EXTENDED
  • 99 - ORT_ENABLE_ALL (recommended)

CUDA Graph Capture

For CUDA execution provider, enable graph capture for better performance:
{
  "search": {
    "past_present_share_buffer": true
  },
  "decoder": {
    "session_options": {
      "provider_options": [
        {
          "cuda": {
            "enable_cuda_graph": "1"
          }
        }
      ]
    }
  }
}
Requirements (from src/generators.h:96):
  • Must be enabled in config
  • Only works with num_beams=1 OR Whisper models
  • CUDA execution provider

Pipeline Models

Some models use a pipeline architecture with multiple ONNX files (from src/config.h:269):
{
  "model": {
    "decoder": {
      "pipeline": [
        {
          "filename": "stage1.onnx",
          "model_id": "stage1",
          "inputs": ["input_ids", "attention_mask"],
          "outputs": ["hidden_states"],
          "run_on_prompt": true,
          "run_on_token_gen": true
        },
        {
          "filename": "stage2.onnx",
          "model_id": "stage2",
          "inputs": ["hidden_states"],
          "outputs": ["logits"],
          "is_lm_head": true
        }
      ]
    }
  }
}

Model State Management

The State class (from src/models/model.h:24) manages model execution:
  • Maintains ORT session and I/O bindings
  • Manages KV cache lifecycle
  • Handles adapter switching (for LoRA)
  • Supports continuous decoding (rewinding)
struct State {
  virtual DeviceSpan<float> Run(int total_length, 
                                DeviceSpan<int32_t>& next_tokens,
                                DeviceSpan<int32_t> next_indices = {}) = 0;
  virtual void RewindTo(size_t index);  // For continuous decoding
  virtual OrtValue* GetInput(const char* name);
  virtual OrtValue* GetOutput(const char* name);
};

Next Steps

Generation

Learn about generation strategies and parameters

KV Cache

Understand KV cache management

Install

Install ONNX Runtime GenAI

Model Builder

Build and optimize your own models

Build docs developers (and LLMs) love