Supported Model Architectures
ONNX Runtime GenAI supports a wide range of decoder-only and encoder-decoder transformer architectures. The library is designed to work with models exported to ONNX format with the appropriate optimizations.Decoder-Only Models
These architectures are currently supported (as of the source code):Language Models
Language Models
- AMD OLMo - AMD’s open language model
- ChatGLM - Bilingual conversational AI model
- DeepSeek - Advanced reasoning models
- ERNIE 4.5 - Baidu’s enhanced language model
- Fara - Emerging architecture
- Gemma - Google’s lightweight models
- gpt-oss - Open source GPT variants
- Granite - IBM’s enterprise models
- InternLM2 - InternLM series models
- Llama - Meta’s Llama family (2, 3, 3.1, etc.)
- Mistral - Mistral AI models
- Nemotron - NVIDIA’s language models
- Phi - Microsoft’s small language models
- Qwen - Alibaba’s Qwen series
- SmolLM3 - Compact language models
Vision-Language Models
Vision-Language Models
- Phi-3 Vision - Microsoft’s multimodal model
- Qwen-VL / Qwen2.5-VL - Vision-language models with image understanding
Audio Models
Audio Models
- Whisper - OpenAI’s speech recognition model
Model Type Detection
The model type is specified ingenai_config.json and determines which implementation is used:
src/models/model_type.h.
Model Configuration
Models are configured through agenai_config.json file that specifies the model architecture, token IDs, and runtime settings.
Configuration File Structure
The configuration is defined insrc/config.h and includes:
Key Configuration Sections
Model Section
Fromsrc/config.h:102:
- type: Model architecture identifier
- pad_token_id: Padding token ID
- eos_token_id: End-of-sequence token ID(s) - can be single value or array
- bos_token_id: Beginning-of-sequence token ID
- vocab_size: Size of the vocabulary
- context_length: Maximum sequence length the model supports
Decoder Section
Fromsrc/config.h:217:
- filename: Path to the ONNX model file
- num_hidden_layers: Number of transformer layers
- num_key_value_heads: Number of KV heads (for GQA/MQA)
- num_attention_heads: Number of attention heads
- head_size: Dimension of each attention head
- session_options: ORT session configuration
- inputs/outputs: Custom input/output name mappings
Session Options
Fromsrc/config.h:80:
Multimodal Models
For vision-language models, additional configuration sections are required:Model Loading and Management
Loading a Model
Models are loaded through theModel::Create API (defined in src/generators.h:163):
Advanced Model Loading
For more control over model loading:Loading from Memory
Models can be loaded from memory buffers instead of files:Model Input/Output Naming
ONNX Runtime GenAI uses a flexible naming system for model inputs and outputs (fromsrc/config.h:14):
Default Names
Custom Naming
You can override default names in the config:Model Optimization and Quantization
Quantization Support
ONNX Runtime GenAI supports various quantization formats:- INT4 - 4-bit integer quantization (RTN, AWQ)
- INT8 - 8-bit integer quantization
- FP16 - Half precision floating point
- FP32 - Full precision floating point
src/python/py/models/builder.py).
Graph Optimizations
Set optimization level in session options:1- ORT_DISABLE_ALL2- ORT_ENABLE_BASIC3- ORT_ENABLE_EXTENDED99- ORT_ENABLE_ALL (recommended)
CUDA Graph Capture
For CUDA execution provider, enable graph capture for better performance:src/generators.h:96):
- Must be enabled in config
- Only works with
num_beams=1OR Whisper models - CUDA execution provider
Pipeline Models
Some models use a pipeline architecture with multiple ONNX files (fromsrc/config.h:269):
Model State Management
TheState class (from src/models/model.h:24) manages model execution:
- Maintains ORT session and I/O bindings
- Manages KV cache lifecycle
- Handles adapter switching (for LoRA)
- Supports continuous decoding (rewinding)
Next Steps
Generation
Learn about generation strategies and parameters
KV Cache
Understand KV cache management
Install
Install ONNX Runtime GenAI
Model Builder
Build and optimize your own models