Skip to main content
Microsoft’s Phi vision models are compact yet powerful multi-modal models that combine visual understanding with language capabilities. ONNX Runtime GenAI supports Phi-3 Vision, Phi-3.5 Vision, and Phi-4 Multi-Modal models.

Supported Models

Phi-3 Vision

128k context length vision model for image understanding

Phi-3.5 Vision

Enhanced vision capabilities with improved accuracy

Phi-4 Multi-Modal

Latest model supporting both vision and audio inputs

Model Architecture

Phi vision models are multi-modal models consisting of several internal components:
  • Vision Encoder: Processes images and extracts visual features
  • Image Embedding: Converts visual features into embeddings compatible with the language model
  • Language Model: Core transformer model for text generation
  • Fusion Layers: Combine visual and text embeddings
For ONNX Runtime GenAI, each internal component is exported as a separate ONNX model for optimal performance.

Building Phi Vision Models

Phi-3 Vision (128k Context)

1

Download PyTorch Model

# Create workspace
mkdir -p phi3-vision-128k-instruct/pytorch
cd phi3-vision-128k-instruct/pytorch

# Download from Hugging Face
huggingface-cli download microsoft/Phi-3-vision-128k-instruct --local-dir .
2

Download Modified Files

cd ..
huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx \
  --include onnx/* --local-dir .
3

Replace Modeling Files

# Replace config (flash_attention_2 -> eager)
rm pytorch/config.json
mv onnx/config.json pytorch/

# Replace modified modeling file
rm pytorch/modeling_phi3_v.py
mv onnx/modeling_phi3_v.py pytorch/

# Add ONNX export helper
mv onnx/image_embedding_phi3_v_for_onnx.py pytorch/

# Move builder script
mv onnx/builder.py .
rm -rf onnx/
4

Build ONNX Models

python3 builder.py \
  --input ./pytorch \
  --output ./cpu \
  --precision fp32 \
  --execution_provider cpu
5

Add Configuration Files

Download the required JSON configuration files:

Using Phi Vision Models

Basic Image Understanding

import onnxruntime_genai as og

# Load model
config = og.Config("./phi3-vision-128k-instruct/cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Load image
images = og.Images.open("image.jpg")

# Create prompt
prompt = "<|user|>\n<|image_1|>\nWhat is shown in this image?<|end|>\n<|assistant|>\n"

# Process inputs
inputs = processor(prompt, images=images)

# Generate response
params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

print("Response: ", end="", flush=True)
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)
print()

Multi-Image Processing

import onnxruntime_genai as og

# Load multiple images
images = og.Images.open("image1.jpg", "image2.jpg", "image3.jpg")

# Reference images in prompt
prompt = """
<|user|>
<|image_1|>
<|image_2|>
<|image_3|>
Compare these three images and describe their similarities and differences.
<|end|>
<|assistant|>
"""

# Process and generate
inputs = processor(prompt, images=images)
generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)

Chat Template Integration

import json
import onnxruntime_genai as og

# Create messages with image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Apply chat template
if hasattr(tokenizer, 'apply_chat_template'):
    prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
else:
    # Manual template
    prompt = f"<|user|>\n<|image_1|>\n{messages[0]['content'][1]['text']}<|end|>\n<|assistant|>\n"

images = og.Images.open("image.jpg")
inputs = processor(prompt, images=images)

Image Input Handling

Supported Image Formats

Phi vision models support common image formats:
  • JPEG/JPG
  • PNG
  • BMP
  • TIFF

Image Preprocessing

The processor automatically handles:
  1. Resizing: Images are resized to the model’s expected dimensions
  2. Normalization: Pixel values are normalized
  3. Patch Extraction: Images are divided into patches
  4. Embedding: Visual patches are converted to embeddings

Image Resolution

# Phi-3 Vision supports high-resolution images
# The model automatically handles various resolutions
images = og.Images.open("high_res_image.jpg")  # Automatically preprocessed

Advanced Usage

Batch Processing

import onnxruntime_genai as og

# Process multiple image-text pairs in batch
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
prompts = [
    "<|user|>\n<|image_1|>\nDescribe this<|end|>\n<|assistant|>\n",
    "<|user|>\n<|image_1|>\nWhat do you see?<|end|>\n<|assistant|>\n",
    "<|user|>\n<|image_1|>\nAnalyze this image<|end|>\n<|assistant|>\n"
]

for img_path, prompt in zip(image_paths, prompts):
    images = og.Images.open(img_path)
    inputs = processor(prompt, images=images)
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    print(f"Processing {img_path}:")
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(tokenizer.decode(new_token), end="", flush=True)
    print("\n")

Custom Generation Parameters

params = og.GeneratorParams(model)
params.set_search_options(
    max_length=4096,           # Maximum output length
    do_sample=True,            # Enable sampling
    top_p=0.9,                 # Nucleus sampling
    top_k=50,                  # Top-k sampling
    temperature=0.7,           # Sampling temperature
    repetition_penalty=1.1     # Penalize repetition
)

Performance Optimization

Choose the right precision for your hardware:
  • FP32: Best accuracy, slower, works on all devices
  • FP16: Good balance, requires GPU with FP16 support
  • INT4: Fastest, smallest memory footprint, slight accuracy loss
# Build with INT4 quantization
python3 builder.py --input ./pytorch --output ./cuda \
  --precision fp16 --execution_provider cuda
config = og.Config("./model/cuda")
config.clear_providers()
config.append_provider("cuda")
model = og.Model(config)
For large images or long sequences:
# Monitor token count
generator = og.Generator(model, params)
generator.set_inputs(inputs)

input_tokens = generator.token_count()
print(f"Input tokens (including image): {input_tokens}")

# Process in chunks if needed
max_new_tokens = 1024
generated = 0

while not generator.is_done() and generated < max_new_tokens:
    generator.generate_next_token()
    generated += 1

Fine-Tuning Support

You can use your own fine-tuned Phi vision models:
1

Fine-tune with PyTorch

Fine-tune the model using your preferred training framework.
2

Replace Weights

# After downloading the base model files
# Replace the *.safetensors files with your fine-tuned weights
cp /path/to/finetuned/*.safetensors ./phi3-vision-128k-instruct/pytorch/
3

Build ONNX Models

python3 builder.py --input ./pytorch --output ./cuda \
  --precision fp16 --execution_provider cuda
4

Update Configurations

Modify genai_config.json and processor_config.json if your fine-tuning changed model architecture or tokenizer.

Troubleshooting

import os

# Verify image path exists
image_path = "image.jpg"
if not os.path.exists(image_path):
    raise FileNotFoundError(f"Image not found: {image_path}")

# Load with error handling
try:
    images = og.Images.open(image_path)
except Exception as e:
    print(f"Error loading image: {e}")
If you encounter OOM errors:
  1. Reduce image resolution before processing
  2. Use INT4 quantization instead of FP16
  3. Reduce max_length parameter
  4. Process images one at a time instead of batching
# Reduce max output length
params.set_search_options(max_length=1024)  # Instead of 4096
If you see flash attention errors:
# Verify config.json has eager attention
cat pytorch/config.json | grep _attn_implementation
# Should show: "_attn_implementation": "eager"

Example Application

Here’s a complete example script for document analysis:
import onnxruntime_genai as og
import argparse

def analyze_document(image_path, question):
    # Load model
    config = og.Config("./phi3-vision-128k-instruct/cuda")
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    # Load document image
    images = og.Images.open(image_path)
    
    # Create prompt
    prompt = f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n"
    
    # Process
    inputs = processor(prompt, images=images)
    
    # Generate
    params = og.GeneratorParams(model)
    params.set_search_options(
        max_length=2048,
        do_sample=True,
        top_p=0.9,
        temperature=0.7
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    response = ""
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        token_text = tokenizer.decode(new_token)
        response += token_text
        print(token_text, end="", flush=True)
    print()
    
    return response

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--image", required=True, help="Path to document image")
    parser.add_argument("--question", required=True, help="Question about the document")
    args = parser.parse_args()
    
    analyze_document(args.image, args.question)

Next Steps

Qwen Vision

Explore Qwen’s advanced vision models

Gemma Vision

Learn about Google’s Gemma vision models

Whisper Audio

Add audio processing capabilities

Model Quantization

Optimize models with quantization

Build docs developers (and LLMs) love