High-Performance Generative AI with ONNX Runtime
Run LLMs and multi-modal models on any device with ease. A complete inference library with optimized KV cache management, sampling strategies, and hardware acceleration.
Quick Start
Get running with your first generative AI model in minutes
Download a model
Download a pre-optimized ONNX model from Hugging Face.
See Download Models for all available options including Foundry Local.
Run inference
Generate text with just a few lines of code.Check out complete examples for chat, vision, and audio models.
Explore by Topic
Dive deep into core concepts, guides, and API references
Core Concepts
Understand models, generation strategies, and KV cache management
Multi-Modal
Work with vision models like Phi-Vision, Qwen-VL, and Gemma
Hardware Acceleration
Optimize with CUDA, DirectML, OpenVINO, QNN, and WebGPU
Python API
Complete Python API reference for all classes and methods
C++ API
Zero-overhead C++ wrapper for high-performance applications
Model Builder
Convert and optimize your own models for ONNX Runtime GenAI
Key Features
Everything you need to deploy generative AI at scale
Multi-Language Support
Use Python, C++, C#, C, or Java bindings with the same performant core
20+ Model Architectures
Llama, Phi, Gemma, Qwen, Mistral, Whisper, and more out of the box
Multi-Modal Ready
Vision and audio models with built-in preprocessing and feature extraction
Advanced Decoding
Constrained decoding, beam search, Multi-LoRA, and continuous decoding
Ready to Build?
Start deploying high-performance generative AI models on any device with ONNX Runtime GenAI