Skip to main content

High-Performance Generative AI with ONNX Runtime

Run LLMs and multi-modal models on any device with ease. A complete inference library with optimized KV cache management, sampling strategies, and hardware acceleration.

Quick Start

Get running with your first generative AI model in minutes

1

Install the package

Install ONNX Runtime GenAI for your preferred language.
pip install onnxruntime-genai
2

Download a model

Download a pre-optimized ONNX model from Hugging Face.
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
  --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* \
  --local-dir .
See Download Models for all available options including Foundry Local.
3

Run inference

Generate text with just a few lines of code.
import onnxruntime_genai as og

model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4')
tokenizer = og.Tokenizer(model)

prompt = "What is the capital of France?"
input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(max_length=200)

generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()
    print(tokenizer.decode(generator.get_next_tokens()[0]), end='', flush=True)
Check out complete examples for chat, vision, and audio models.

Explore by Topic

Dive deep into core concepts, guides, and API references

Core Concepts

Understand models, generation strategies, and KV cache management

Multi-Modal

Work with vision models like Phi-Vision, Qwen-VL, and Gemma

Hardware Acceleration

Optimize with CUDA, DirectML, OpenVINO, QNN, and WebGPU

Python API

Complete Python API reference for all classes and methods

C++ API

Zero-overhead C++ wrapper for high-performance applications

Model Builder

Convert and optimize your own models for ONNX Runtime GenAI

Key Features

Everything you need to deploy generative AI at scale

Multi-Language Support

Use Python, C++, C#, C, or Java bindings with the same performant core

20+ Model Architectures

Llama, Phi, Gemma, Qwen, Mistral, Whisper, and more out of the box

Multi-Modal Ready

Vision and audio models with built-in preprocessing and feature extraction

Advanced Decoding

Constrained decoding, beam search, Multi-LoRA, and continuous decoding

Ready to Build?

Start deploying high-performance generative AI models on any device with ONNX Runtime GenAI

Build docs developers (and LLMs) love