Learn more about Mintlify

Enter your email to receive updates about new features and product releases.

Introduction to ONNX Runtime GenAI

ONNX Runtime GenAI provides an easy, flexible, and performant way to run generative AI models on device. It implements the complete generative AI loop for ONNX models, handling all the complexity of inference so you can focus on building applications.

What is ONNX Runtime GenAI?

ONNX Runtime GenAI is a library that runs Large Language Models (LLMs) and other generative AI models with ONNX Runtime. It provides a high-level API that abstracts away the complexities of:

Pre and post processing
Inference with ONNX Runtime
Logits processing
Search and sampling
KV cache management
Grammar specification for tool calling

Why Use ONNX Runtime GenAI?

Cross-Platform

Run models on Windows, Linux, macOS, and Android with support for x86, x64, and arm64 architectures.

Hardware Acceleration

Leverage CPU, CUDA, DirectML, TensorRT, OpenVINO, QNN, and WebGPU for optimal performance.

Multiple Languages

Use Python, C#, C/C++, or Java APIs to integrate into your application.

Production Ready

Powers Microsoft products including Foundry Local, Windows ML, and Visual Studio Code AI Toolkit.

Key Capabilities

Supported Model Architectures

ONNX Runtime GenAI supports a wide range of popular model architectures:

Language Models: Llama, Phi, Mistral, Gemma, Qwen, DeepSeek, Granite, InternLM2, SmolLM3, and more
Vision Models: Phi-3 Vision, Qwen2-VL
Speech Models: Whisper
Other: ChatGLM, ERNIE, Fara, Nemotron, AMD OLMo

Advanced Features

Multi-LoRA Support

Run multiple Low-Rank Adaptation (LoRA) models efficiently for fine-tuned model inference.

Continuous Decoding

Maintain conversation context across multiple turns for chat applications.

Constrained Decoding

Generate outputs that conform to specific grammars or JSON schemas for tool calling and structured outputs.

Platform Support

Platform	Supported
Operating Systems	Linux, Windows, macOS, Android
Architectures	x86, x64, arm64
Execution Providers	CPU, CUDA, DirectML, TensorRT-RTX, OpenVINO, QNN, WebGPU
Languages	Python, C#, C/C++, Java (build from source)

ONNX Runtime GenAI is actively developed with regular updates. Check the GitHub repository for the latest features and supported models.

Next Steps

Installation

Install ONNX Runtime GenAI for your platform and language

Quickstart

Run your first model in minutes with our quickstart guide

Installation

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Introduction

Introduction to ONNX Runtime GenAI

What is ONNX Runtime GenAI?

Why Use ONNX Runtime GenAI?

Cross-Platform

Hardware Acceleration

Multiple Languages

Production Ready

Key Capabilities

Supported Model Architectures

Advanced Features

Platform Support

Next Steps

Installation

Quickstart

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Introduction to ONNX Runtime GenAI

​What is ONNX Runtime GenAI?

​Why Use ONNX Runtime GenAI?

Cross-Platform

Hardware Acceleration

Multiple Languages

Production Ready

​Key Capabilities

​Supported Model Architectures

​Advanced Features

​Platform Support

​Next Steps

Installation

Quickstart

Build docs developers (and LLMs) love

Introduction to ONNX Runtime GenAI

What is ONNX Runtime GenAI?

Why Use ONNX Runtime GenAI?

Key Capabilities

Supported Model Architectures

Advanced Features

Platform Support

Next Steps