Quickstart Guide
This guide will walk you through running your first generative AI model using ONNX Runtime GenAI. We’ll use the Phi-3 model as an example, which is optimized for on-device AI scenarios.This quickstart uses Python. For C# or C++ examples, see the examples directory.
Prerequisites
Before starting, ensure you have:- Python 3.8 or later installed
- ONNX Runtime GenAI installed (see Installation)
- At least 4GB of free disk space for the model
- 8GB+ RAM recommended
Step 1: Download the Model
First, download a pre-optimized ONNX model. We’ll use the Phi-3 Mini model optimized for CPU.Alternative: Download via Foundry Local
You can also use Foundry Local to download models:Step 2: Install Required Packages
Ensure you have the necessary Python packages:Step 3: Run Your First Model
Create a Python script to run inference with streaming output:Step 4: Run the Script
Execute your script:Expected Output
You should see output similar to:Understanding the Code
Let’s break down the key components:Load Model and Tokenizer
Model: Loads the ONNX model from the specified directoryTokenizer: Handles text encoding/decoding using the model’s vocabularyTokenizerStream: Enables streaming token decoding for real-time output
Configure Generation Parameters
max_length: Maximum number of tokens to generatebatch_size: Number of sequences to generate simultaneously- Additional options:
top_k,top_p,temperature,num_beams
Advanced Examples
Continuous Chat with History
For a chat application that maintains conversation history:Performance Tips
Choose the Right Quantization
- INT4: Best for CPU, smallest model size
- FP16: Recommended for GPUs
- FP32: Highest accuracy, larger size
Use Appropriate Hardware
- CPU: Good for testing and small models
- CUDA: Best for NVIDIA GPUs
- DirectML: Windows GPU acceleration
- TensorRT: Optimized NVIDIA inference
Batch Processing
Process multiple prompts together to improve throughput:
Adjust Generation Parameters
- Lower
max_lengthfor faster responses - Adjust
temperaturefor creativity (0.0-1.0) - Use
top_kandtop_pfor quality/speed tradeoff
Common Issues and Solutions
Slow Generation Speed
Slow Generation Speed
- Use GPU acceleration if available
- Download INT4 quantized models for CPU
- Reduce
max_lengthparameter - Close other applications to free up RAM
Out of Memory Errors
Out of Memory Errors
- Use smaller batch sizes
- Download a more quantized model (INT4 vs FP16)
- Reduce
max_lengthparameter - Ensure enough RAM/VRAM for the model
Model Not Found Error
Model Not Found Error
Verify the model path is correct:The directory should contain:
genai_config.json*.onnxfiles- Tokenizer files
Empty or Incorrect Output
Empty or Incorrect Output
- Verify chat template matches your model
- Check that input prompt is not empty
- Ensure
max_lengthis sufficient - Try adjusting temperature and sampling parameters
Next Steps
Explore More Models
Browse ONNX models on Hugging Face for different use cases
Advanced Features
Learn about:
- Multi-LoRA support
- Constrained decoding for JSON output
- Vision and audio models
- Custom model optimization
API Reference
Detailed documentation of all classes and methods in the ONNX Runtime GenAI API
Examples Repository
Complete examples for Python, C#, C++, and more advanced scenarios
Download Models
For a comprehensive guide on downloading and preparing models, see:Download Models Guide
Learn how to download models via Foundry Local, Hugging Face, or build your own