Quickstart Guide

This guide will walk you through running your first generative AI model using ONNX Runtime GenAI. We’ll use the Phi-3 model as an example, which is optimized for on-device AI scenarios.

This quickstart uses Python. For C# or C++ examples, see the examples directory.

Prerequisites

Before starting, ensure you have:

Python 3.8 or later installed
ONNX Runtime GenAI installed (see Installation)
At least 4GB of free disk space for the model
8GB+ RAM recommended

Step 1: Download the Model

First, download a pre-optimized ONNX model. We’ll use the Phi-3 Mini model optimized for CPU.

Install Hugging Face CLI

pip install huggingface-hub[cli]

Download Phi-3 Model

Download the CPU-optimized INT4 quantized model:

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
  --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* \
  --local-dir .

This downloads the model to ./cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/

For GPU acceleration, download a GPU-optimized variant:

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
  --include cuda/cuda-int4-rtn-block-32/* \
  --local-dir .

Alternative: Download via Foundry Local

You can also use Foundry Local to download models:

# Install Foundry Local from https://github.com/microsoft/Foundry-Local/releases
foundry model list
foundry model download Phi-4-generic-cpu
foundry cache location

Step 2: Install Required Packages

Ensure you have the necessary Python packages:

pip install numpy
pip install --pre onnxruntime-genai

Step 3: Run Your First Model

Create a Python script to run inference with streaming output:

import onnxruntime_genai as og

# Load the model
model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()

# Set generation parameters
search_options = {
    'max_length': 2048,
    'batch_size': 1
}

# Define chat template
chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

# Get user input
text = input("Input: ")
if not text:
    print("Error, input cannot be empty")
    exit()

# Format prompt with chat template
prompt = f'{chat_template.format(input=text)}'

# Encode the prompt
input_tokens = tokenizer.encode(prompt)

# Create generator
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)

# Generate tokens
print("Output: ", end='', flush=True)

try:
    generator.append_tokens(input_tokens)
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(stream.decode(new_token), end='', flush=True)
except KeyboardInterrupt:
    print("  --control+c pressed, aborting generation--")

print()
del generator

Step 4: Run the Script

Execute your script:

python simple_chat.py

Expected Output

You should see output similar to:

Input: What is the capital of France?
Output: The capital of France is Paris. It is located in the north-central 
part of the country and is known for its rich history, culture, and iconic 
landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral.

Understanding the Code

Let’s break down the key components:

Load Model and Tokenizer

model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()

Model: Loads the ONNX model from the specified directory
Tokenizer: Handles text encoding/decoding using the model’s vocabulary
TokenizerStream: Enables streaming token decoding for real-time output

Configure Generation Parameters

search_options = {
    'max_length': 2048,
    'batch_size': 1
}
params = og.GeneratorParams(model)
params.set_search_options(**search_options)

max_length: Maximum number of tokens to generate
batch_size: Number of sequences to generate simultaneously
Additional options: top_k, top_p, temperature, num_beams

Encode Input and Generate

input_tokens = tokenizer.encode(prompt)
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(stream.decode(new_token), end='', flush=True)

Encode text to tokens
Create generator with model and parameters
Generate tokens one at a time in a loop
Decode and print each token for streaming output

Advanced Examples

Continuous Chat with History

For a chat application that maintains conversation history:

import onnxruntime_genai as og

model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()

params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

# System prompt
system_prompt = '<|system|>\nYou are a helpful AI assistant.<|end|>\n'
system_tokens = tokenizer.encode(system_prompt)

generator = og.Generator(model, params)
generator.append_tokens(system_tokens)
system_prompt_length = len(system_tokens)

while True:
    text = input("Prompt (use quit() to exit): ")
    if text == "quit()":
        break
    
    # Format user message
    user_prompt = f'<|user|>\n{text}<|end|>\n<|assistant|>'
    user_tokens = tokenizer.encode(user_prompt)
    generator.append_tokens(user_tokens)
    
    print("\nOutput: ", end='', flush=True)
    
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(stream.decode(new_token), end='', flush=True)
    
    print("\n")

Performance Tips

Choose the Right Quantization

INT4: Best for CPU, smallest model size
FP16: Recommended for GPUs
FP32: Highest accuracy, larger size

Use Appropriate Hardware

CPU: Good for testing and small models
CUDA: Best for NVIDIA GPUs
DirectML: Windows GPU acceleration
TensorRT: Optimized NVIDIA inference

Batch Processing

Process multiple prompts together to improve throughput:

prompts = ["prompt1", "prompt2", "prompt3"]
input_tokens = tokenizer.encode_batch(prompts)

Adjust Generation Parameters

Lower max_length for faster responses
Adjust temperature for creativity (0.0-1.0)
Use top_k and top_p for quality/speed tradeoff

Common Issues and Solutions

Slow Generation Speed

Use GPU acceleration if available
Download INT4 quantized models for CPU
Reduce max_length parameter
Close other applications to free up RAM

Out of Memory Errors

Use smaller batch sizes
Download a more quantized model (INT4 vs FP16)
Reduce max_length parameter
Ensure enough RAM/VRAM for the model

Model Not Found Error

Verify the model path is correct:

import os
model_path = 'cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4'
print(f"Model exists: {os.path.exists(model_path)}")
print(f"Contents: {os.listdir(model_path)}")

The directory should contain:

genai_config.json
*.onnx files
Tokenizer files

Empty or Incorrect Output

Verify chat template matches your model
Check that input prompt is not empty
Ensure max_length is sufficient
Try adjusting temperature and sampling parameters

Next Steps

Explore More Models

Browse ONNX models on Hugging Face for different use cases

Advanced Features

Learn about:

Multi-LoRA support
Constrained decoding for JSON output
Vision and audio models
Custom model optimization

API Reference

Detailed documentation of all classes and methods in the ONNX Runtime GenAI API

Examples Repository

Complete examples for Python, C#, C++, and more advanced scenarios

Download Models

For a comprehensive guide on downloading and preparing models, see:

Download Models Guide

Learn how to download models via Foundry Local, Hugging Face, or build your own

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Quickstart

Quickstart Guide

Prerequisites

Step 1: Download the Model

Alternative: Download via Foundry Local

Step 2: Install Required Packages

Step 3: Run Your First Model

Step 4: Run the Script

Expected Output

Understanding the Code

Advanced Examples

Continuous Chat with History

Performance Tips

Choose the Right Quantization

Use Appropriate Hardware

Batch Processing

Adjust Generation Parameters

Common Issues and Solutions

Next Steps

Explore More Models

Advanced Features

API Reference

Examples Repository

Download Models

Download Models Guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Quickstart Guide

​Prerequisites

​Step 1: Download the Model

​Alternative: Download via Foundry Local

​Step 2: Install Required Packages

​Step 3: Run Your First Model

​Step 4: Run the Script

​Expected Output

​Understanding the Code

​Advanced Examples

​Continuous Chat with History

​Performance Tips

Choose the Right Quantization

Use Appropriate Hardware

Batch Processing

Adjust Generation Parameters

​Common Issues and Solutions

​Next Steps

Explore More Models

Advanced Features

API Reference

Examples Repository

​Download Models

Download Models Guide

Build docs developers (and LLMs) love

Quickstart Guide

Prerequisites

Step 1: Download the Model

Alternative: Download via Foundry Local

Step 2: Install Required Packages

Step 3: Run Your First Model

Step 4: Run the Script

Expected Output

Understanding the Code

Advanced Examples

Continuous Chat with History

Performance Tips

Common Issues and Solutions

Next Steps

Download Models