Skip to main content

Introduction to ONNX Runtime GenAI

ONNX Runtime GenAI provides an easy, flexible, and performant way to run generative AI models on device. It implements the complete generative AI loop for ONNX models, handling all the complexity of inference so you can focus on building applications.

What is ONNX Runtime GenAI?

ONNX Runtime GenAI is a library that runs Large Language Models (LLMs) and other generative AI models with ONNX Runtime. It provides a high-level API that abstracts away the complexities of:
  • Pre and post processing
  • Inference with ONNX Runtime
  • Logits processing
  • Search and sampling
  • KV cache management
  • Grammar specification for tool calling

Why Use ONNX Runtime GenAI?

Cross-Platform

Run models on Windows, Linux, macOS, and Android with support for x86, x64, and arm64 architectures.

Hardware Acceleration

Leverage CPU, CUDA, DirectML, TensorRT, OpenVINO, QNN, and WebGPU for optimal performance.

Multiple Languages

Use Python, C#, C/C++, or Java APIs to integrate into your application.

Production Ready

Powers Microsoft products including Foundry Local, Windows ML, and Visual Studio Code AI Toolkit.

Key Capabilities

Supported Model Architectures

ONNX Runtime GenAI supports a wide range of popular model architectures:
  • Language Models: Llama, Phi, Mistral, Gemma, Qwen, DeepSeek, Granite, InternLM2, SmolLM3, and more
  • Vision Models: Phi-3 Vision, Qwen2-VL
  • Speech Models: Whisper
  • Other: ChatGLM, ERNIE, Fara, Nemotron, AMD OLMo

Advanced Features

Run multiple Low-Rank Adaptation (LoRA) models efficiently for fine-tuned model inference.
Maintain conversation context across multiple turns for chat applications.
Generate outputs that conform to specific grammars or JSON schemas for tool calling and structured outputs.

Platform Support

PlatformSupported
Operating SystemsLinux, Windows, macOS, Android
Architecturesx86, x64, arm64
Execution ProvidersCPU, CUDA, DirectML, TensorRT-RTX, OpenVINO, QNN, WebGPU
LanguagesPython, C#, C/C++, Java (build from source)
ONNX Runtime GenAI is actively developed with regular updates. Check the GitHub repository for the latest features and supported models.

Next Steps

Installation

Install ONNX Runtime GenAI for your platform and language

Quickstart

Run your first model in minutes with our quickstart guide

Build docs developers (and LLMs) love