Introduction to ONNX Runtime GenAI
ONNX Runtime GenAI provides an easy, flexible, and performant way to run generative AI models on device. It implements the complete generative AI loop for ONNX models, handling all the complexity of inference so you can focus on building applications.What is ONNX Runtime GenAI?
ONNX Runtime GenAI is a library that runs Large Language Models (LLMs) and other generative AI models with ONNX Runtime. It provides a high-level API that abstracts away the complexities of:- Pre and post processing
- Inference with ONNX Runtime
- Logits processing
- Search and sampling
- KV cache management
- Grammar specification for tool calling
Why Use ONNX Runtime GenAI?
Cross-Platform
Run models on Windows, Linux, macOS, and Android with support for x86, x64, and arm64 architectures.
Hardware Acceleration
Leverage CPU, CUDA, DirectML, TensorRT, OpenVINO, QNN, and WebGPU for optimal performance.
Multiple Languages
Use Python, C#, C/C++, or Java APIs to integrate into your application.
Production Ready
Powers Microsoft products including Foundry Local, Windows ML, and Visual Studio Code AI Toolkit.
Key Capabilities
Supported Model Architectures
ONNX Runtime GenAI supports a wide range of popular model architectures:- Language Models: Llama, Phi, Mistral, Gemma, Qwen, DeepSeek, Granite, InternLM2, SmolLM3, and more
- Vision Models: Phi-3 Vision, Qwen2-VL
- Speech Models: Whisper
- Other: ChatGLM, ERNIE, Fara, Nemotron, AMD OLMo
Advanced Features
Multi-LoRA Support
Multi-LoRA Support
Run multiple Low-Rank Adaptation (LoRA) models efficiently for fine-tuned model inference.
Continuous Decoding
Continuous Decoding
Maintain conversation context across multiple turns for chat applications.
Constrained Decoding
Constrained Decoding
Generate outputs that conform to specific grammars or JSON schemas for tool calling and structured outputs.
Platform Support
| Platform | Supported |
|---|---|
| Operating Systems | Linux, Windows, macOS, Android |
| Architectures | x86, x64, arm64 |
| Execution Providers | CPU, CUDA, DirectML, TensorRT-RTX, OpenVINO, QNN, WebGPU |
| Languages | Python, C#, C/C++, Java (build from source) |
ONNX Runtime GenAI is actively developed with regular updates. Check the GitHub repository for the latest features and supported models.
Next Steps
Installation
Install ONNX Runtime GenAI for your platform and language
Quickstart
Run your first model in minutes with our quickstart guide