Skip to main content

Quickstart Guide

This guide will walk you through running your first generative AI model using ONNX Runtime GenAI. We’ll use the Phi-3 model as an example, which is optimized for on-device AI scenarios.
This quickstart uses Python. For C# or C++ examples, see the examples directory.

Prerequisites

Before starting, ensure you have:
  • Python 3.8 or later installed
  • ONNX Runtime GenAI installed (see Installation)
  • At least 4GB of free disk space for the model
  • 8GB+ RAM recommended

Step 1: Download the Model

First, download a pre-optimized ONNX model. We’ll use the Phi-3 Mini model optimized for CPU.
1

Install Hugging Face CLI

pip install huggingface-hub[cli]
2

Download Phi-3 Model

Download the CPU-optimized INT4 quantized model:
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
  --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* \
  --local-dir .
This downloads the model to ./cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/
For GPU acceleration, download a GPU-optimized variant:
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
  --include cuda/cuda-int4-rtn-block-32/* \
  --local-dir .

Alternative: Download via Foundry Local

You can also use Foundry Local to download models:
# Install Foundry Local from https://github.com/microsoft/Foundry-Local/releases
foundry model list
foundry model download Phi-4-generic-cpu
foundry cache location

Step 2: Install Required Packages

Ensure you have the necessary Python packages:
pip install numpy
pip install --pre onnxruntime-genai

Step 3: Run Your First Model

Create a Python script to run inference with streaming output:
import onnxruntime_genai as og

# Load the model
model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()

# Set generation parameters
search_options = {
    'max_length': 2048,
    'batch_size': 1
}

# Define chat template
chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

# Get user input
text = input("Input: ")
if not text:
    print("Error, input cannot be empty")
    exit()

# Format prompt with chat template
prompt = f'{chat_template.format(input=text)}'

# Encode the prompt
input_tokens = tokenizer.encode(prompt)

# Create generator
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)

# Generate tokens
print("Output: ", end='', flush=True)

try:
    generator.append_tokens(input_tokens)
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(stream.decode(new_token), end='', flush=True)
except KeyboardInterrupt:
    print("  --control+c pressed, aborting generation--")

print()
del generator

Step 4: Run the Script

Execute your script:
python simple_chat.py

Expected Output

You should see output similar to:
Input: What is the capital of France?
Output: The capital of France is Paris. It is located in the north-central 
part of the country and is known for its rich history, culture, and iconic 
landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral.

Understanding the Code

Let’s break down the key components:
1

Load Model and Tokenizer

model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()
  • Model: Loads the ONNX model from the specified directory
  • Tokenizer: Handles text encoding/decoding using the model’s vocabulary
  • TokenizerStream: Enables streaming token decoding for real-time output
2

Configure Generation Parameters

search_options = {
    'max_length': 2048,
    'batch_size': 1
}
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
  • max_length: Maximum number of tokens to generate
  • batch_size: Number of sequences to generate simultaneously
  • Additional options: top_k, top_p, temperature, num_beams
3

Encode Input and Generate

input_tokens = tokenizer.encode(prompt)
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(stream.decode(new_token), end='', flush=True)
  • Encode text to tokens
  • Create generator with model and parameters
  • Generate tokens one at a time in a loop
  • Decode and print each token for streaming output

Advanced Examples

Continuous Chat with History

For a chat application that maintains conversation history:
import onnxruntime_genai as og

model = og.Model('cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()

params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

# System prompt
system_prompt = '<|system|>\nYou are a helpful AI assistant.<|end|>\n'
system_tokens = tokenizer.encode(system_prompt)

generator = og.Generator(model, params)
generator.append_tokens(system_tokens)
system_prompt_length = len(system_tokens)

while True:
    text = input("Prompt (use quit() to exit): ")
    if text == "quit()":
        break
    
    # Format user message
    user_prompt = f'<|user|>\n{text}<|end|>\n<|assistant|>'
    user_tokens = tokenizer.encode(user_prompt)
    generator.append_tokens(user_tokens)
    
    print("\nOutput: ", end='', flush=True)
    
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(stream.decode(new_token), end='', flush=True)
    
    print("\n")

Performance Tips

Choose the Right Quantization

  • INT4: Best for CPU, smallest model size
  • FP16: Recommended for GPUs
  • FP32: Highest accuracy, larger size

Use Appropriate Hardware

  • CPU: Good for testing and small models
  • CUDA: Best for NVIDIA GPUs
  • DirectML: Windows GPU acceleration
  • TensorRT: Optimized NVIDIA inference

Batch Processing

Process multiple prompts together to improve throughput:
prompts = ["prompt1", "prompt2", "prompt3"]
input_tokens = tokenizer.encode_batch(prompts)

Adjust Generation Parameters

  • Lower max_length for faster responses
  • Adjust temperature for creativity (0.0-1.0)
  • Use top_k and top_p for quality/speed tradeoff

Common Issues and Solutions

  • Use GPU acceleration if available
  • Download INT4 quantized models for CPU
  • Reduce max_length parameter
  • Close other applications to free up RAM
  • Use smaller batch sizes
  • Download a more quantized model (INT4 vs FP16)
  • Reduce max_length parameter
  • Ensure enough RAM/VRAM for the model
Verify the model path is correct:
import os
model_path = 'cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4'
print(f"Model exists: {os.path.exists(model_path)}")
print(f"Contents: {os.listdir(model_path)}")
The directory should contain:
  • genai_config.json
  • *.onnx files
  • Tokenizer files
  • Verify chat template matches your model
  • Check that input prompt is not empty
  • Ensure max_length is sufficient
  • Try adjusting temperature and sampling parameters

Next Steps

Explore More Models

Browse ONNX models on Hugging Face for different use cases

Advanced Features

Learn about:
  • Multi-LoRA support
  • Constrained decoding for JSON output
  • Vision and audio models
  • Custom model optimization

API Reference

Detailed documentation of all classes and methods in the ONNX Runtime GenAI API

Examples Repository

Complete examples for Python, C#, C++, and more advanced scenarios

Download Models

For a comprehensive guide on downloading and preparing models, see:

Download Models Guide

Learn how to download models via Foundry Local, Hugging Face, or build your own

Build docs developers (and LLMs) love