Qwen Vision Models

Qwen2.5-VL is an advanced vision-language model that supports multi-image understanding, dynamic resolution, and sophisticated spatial reasoning through Multi-Resolution Rotary Position Embedding (MRoPE).

Overview

Qwen2.5-VL models are state-of-the-art vision-language models from Alibaba Cloud that excel at:

Multi-image understanding: Process and reason across multiple images
Dynamic resolution: Handle images of varying sizes and aspect ratios
3D positional encoding: MRoPE for better spatial understanding
Long context: Support for extended context lengths

Qwen2.5-VL uses a unique 3D position encoding scheme with temporal, height, and width dimensions for superior spatial reasoning.

Architecture Details

Multi-Resolution Rotary Position Embedding (MRoPE)

Qwen2.5-VL uses MRoPE to encode positional information in three dimensions:

# Position IDs shape: [3, batch_size, sequence_length]
# Dimension 0: Temporal dimension
# Dimension 1: Height dimension  
# Dimension 2: Width dimension

This 3D encoding allows the model to:

Better understand spatial relationships in images
Handle dynamic image resolutions
Process multi-image inputs with proper position awareness

Model Components

The Qwen2.5-VL architecture consists of:

Vision Encoder: Processes images into visual features
- Patch embedding for image tokenization
- Vision attention layers
- Patch merger for feature aggregation
Language Model: Core text generation model
- Modified attention with MRoPE
- Grouped Query Attention (GQA)
- RMS Layer Normalization (always computed in FP32)

Vision Pipeline: Multi-stage processing

Image → Patch Embed → Vision Attention → Patch Merger → Embeddings

Building Qwen2.5-VL Models

Qwen2.5-VL requires specific versions of dependencies. Follow the installation steps carefully.

Prerequisites

Install ONNX Runtime

# Install nightly ONNX Runtime GenAI for CUDA
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-genai-cuda

# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime-gpu

# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-gpu

Install PyTorch

# Ensure PyTorch >= 2.7.0
pip install torch==2.7.0

Install Additional Dependencies

pip install transformers
pip install pillow
pip install numpy==1.26.4  # Must be < 2.0.0
pip install --pre onnxscript

Model Export

Download Base Model

mkdir -p qwen2.5-vl-7b-instruct
cd qwen2.5-vl-7b-instruct

# Download from Hugging Face
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct \
  --local-dir ./pytorch

Export to ONNX

# Use the builder from ONNX Runtime GenAI source
from onnxruntime_genai.models.builder import build_model

build_model(
    model_name="Qwen/Qwen2.5-VL-7B-Instruct",
    input_path="./pytorch",
    output_path="./onnx-cuda",
    precision="fp16",
    execution_provider="cuda",
    cache_dir="./cache"
)

Precision Options

FP16
BF16
FP32

# FP16 for CUDA (recommended for most GPUs)
python3 -m onnxruntime_genai.models.builder \
  --model_name Qwen/Qwen2.5-VL-7B-Instruct \
  --output ./onnx-fp16 \
  --precision fp16 \
  --execution_provider cuda

# BF16 for newer GPUs (A100, H100)
python3 -m onnxruntime_genai.models.builder \
  --model_name Qwen/Qwen2.5-VL-7B-Instruct \
  --output ./onnx-bf16 \
  --precision bf16 \
  --execution_provider cuda

# FP32 for CPU
python3 -m onnxruntime_genai.models.builder \
  --model_name Qwen/Qwen2.5-VL-7B-Instruct \
  --output ./onnx-fp32 \
  --precision fp32 \
  --execution_provider cpu

Using Qwen2.5-VL

Basic Usage

import onnxruntime_genai as og

# Load model
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Load image
images = og.Images.open("image.jpg")

# Create prompt
prompt = "Describe this image in detail."

# Process inputs
inputs = processor(prompt, images=images)

# Generate response
params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

print("Response: ", end="", flush=True)
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)
print()

Multi-Image Processing

Qwen2.5-VL excels at reasoning across multiple images:

import onnxruntime_genai as og

# Load multiple images
images = og.Images.open(
    "product_front.jpg",
    "product_back.jpg",
    "product_side.jpg"
)

# Ask comparative question
prompt = """
Analyze these three product images:
1. What are the key features visible from different angles?
2. Are there any defects or quality issues?
3. How would you rate the product packaging?
"""

inputs = processor(prompt, images=images)

# Generate detailed analysis
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=4096,
    temperature=0.7,
    top_p=0.9
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)

Chat Conversation with Images

import json
import onnxruntime_genai as og

# Initialize model
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Chat history
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can see images."
    },
    {
        "role": "user",
        "content": "What objects can you identify in this image?"
    }
]

# Load image
images = og.Images.open("scene.jpg")

# Convert messages to prompt
messages_json = json.dumps(messages)
prompt = messages_json  # Processor handles chat template

# Process and generate
inputs = processor(prompt, images=images)

params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

response = ""
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    token_text = tokenizer.decode(new_token)
    response += token_text
    print(token_text, end="", flush=True)

# Add assistant response to history
messages.append({"role": "assistant", "content": response})

Image Preprocessing

Automatic Resolution Handling

Qwen2.5-VL automatically handles various image resolutions:

# High resolution image
high_res = og.Images.open("high_resolution_4k.jpg")

# Low resolution image  
low_res = og.Images.open("thumbnail_128x128.jpg")

# Both are automatically preprocessed to optimal resolution
inputs_high = processor("Analyze this high-res image", images=high_res)
inputs_low = processor("Analyze this thumbnail", images=low_res)

Grid Dimensions

The model uses grid-based image processing with temporal, height, and width dimensions:

# Grid dimensions are automatically calculated based on image size
# For a 1024x768 image with patch size 14:
# - Temporal: 1 (single image)
# - Height: 1024 / 14 ≈ 73 patches
# - Width: 768 / 14 ≈ 55 patches

Advanced Features

Custom Position IDs

For advanced use cases, you can work with 3D position IDs:

import torch

# Example: Create custom 3D position IDs
# Shape: [3, batch_size, sequence_length]
batch_size = 1
sequence_length = 100

# Temporal, Height, Width dimensions
position_ids = torch.zeros((3, batch_size, sequence_length), dtype=torch.int64)

# For text tokens, all dimensions use same sequential IDs
for i in range(sequence_length):
    position_ids[0, 0, i] = i  # Temporal
    position_ids[1, 0, i] = i  # Height
    position_ids[2, 0, i] = i  # Width

# For image patches, dimensions vary based on spatial location
# (automatically handled by the processor)

Vision Pipeline Components

Access individual vision pipeline components:

# The vision pipeline consists of:
# 1. Patch Embedding: Image -> Patches
# 2. Vision Attention: Process patches with attention
# 3. Patch Merger: Merge patches to reduce sequence length

# These are automatically orchestrated by the processor
images = og.Images.open("image.jpg")
inputs = processor(prompt, images=images)

# The processor handles:
# - Patch extraction from images
# - Window-based attention reordering
# - Spatial merge operations
# - Final embedding generation

Performance Optimization

Execution Provider Selection

Choose the right execution provider for your hardware:

config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")

# For NVIDIA GPUs
config.clear_providers()
config.append_provider("cuda")

# For DirectML (AMD/Intel)
# config.append_provider("dml")

model = og.Model(config)

Precision Trade-offs

Different precisions offer different trade-offs:

Precision	Speed	Memory	Accuracy	Hardware
FP32	Slow	High	Best	All
FP16	Fast	Medium	Good	Modern GPUs
BF16	Fast	Medium	Very Good	A100, H100

LayerNorm and RoPE are always computed in FP32 internally for numerical stability, regardless of model precision.

Batch Processing

Process multiple images efficiently:

# Single batch with multiple images
images = og.Images.open("img1.jpg", "img2.jpg", "img3.jpg")
prompt = "Analyze all these images together."
inputs = processor(prompt, images=images)

# Batch size is automatically determined from inputs
generator = og.Generator(model, params)
generator.set_inputs(inputs)

Memory Management

For large images or long contexts:

# Monitor token count
generator.set_inputs(inputs)
total_tokens = generator.token_count()
print(f"Total input tokens: {total_tokens}")

# Adjust max_length based on available memory
available_memory_gb = 16  # Your GPU memory
if total_tokens > 2048:
    max_new_tokens = 1024  # Reduce for large inputs
else:
    max_new_tokens = 2048

params.set_search_options(max_length=total_tokens + max_new_tokens)

Implementation Details

RoPE Computation

Qwen2.5-VL uses a custom MRoPE implementation:

# Pseudo-code for MRoPE logic
# 1. Calculate dynamic RoPE caches from 3D position_ids
cos_cache, sin_cache = make_dynamic_rope_caches(position_ids)
# Shape: [3, batch_size, sequence_length, head_dim]

# 2. Flatten and reorder based on MRoPE sections
flat_cos, flat_sin = make_mrope_flattened_caches(cos_cache, sin_cache)
# Shape: [batch_size * sequence_length, head_dim / 2]

# 3. Apply rotation to Q and K
q_rotated = apply_mrope_rotation(q, flat_cos, flat_sin)
k_rotated = apply_mrope_rotation(k, flat_cos, flat_sin)

# 4. Grouped Query Attention
output = grouped_query_attention(q_rotated, k_rotated, v)

Layer Normalization

# Qwen2.5-VL uses RMSNorm with forced FP32 computation
# Regardless of model precision (FP16/BF16), normalization uses FP32

# This is configured automatically in the builder:
layernorm_attrs = {
    "cast": {
        "use_fp32": True,      # Compute in FP32
        "root_input": True,    # Cast input to FP32
        "skip_input": True,    # Cast skip connection to FP32
        "output_0": True,      # Cast output back to model dtype
        "output_3": True       # Cast residual output back
    }
}

Troubleshooting

NumPy Version Error

If you see NumPy 2.0 compatibility errors:

# Uninstall numpy 2.0+
pip uninstall -y numpy

# Install compatible version
pip install numpy==1.26.4

PyTorch Version Error

Qwen2.5-VL requires PyTorch >= 2.7.0:

# Check version
python -c "import torch; print(torch.__version__)"

# Upgrade if needed
pip install --upgrade torch==2.7.0

Position IDs Shape Error

If you encounter position_ids shape errors:

# Verify position_ids shape is [3, batch_size, sequence_length]
# This is handled automatically by the processor

# If manually creating inputs, ensure correct shape:
position_ids = torch.zeros((3, 1, seq_len), dtype=torch.int64)

OOM with High Resolution Images

For very high resolution images:

# The model automatically handles resolution
# But you can pre-resize large images if needed:

from PIL import Image

img = Image.open("very_large_image.jpg")
max_size = 2048
if max(img.size) > max_size:
    img.thumbnail((max_size, max_size), Image.LANCZOS)
    img.save("resized.jpg")

images = og.Images.open("resized.jpg")

Example: Document Understanding

import onnxruntime_genai as og
import argparse

def analyze_document(image_path, task="summarize"):
    """Analyze a document image with Qwen2.5-VL."""
    
    # Load model
    config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    # Load document
    images = og.Images.open(image_path)
    
    # Create task-specific prompt
    prompts = {
        "summarize": "Summarize the key points from this document.",
        "extract": "Extract all important information including names, dates, and numbers.",
        "translate": "Translate the text in this document to English.",
        "ocr": "Extract all text from this document."
    }
    
    prompt = prompts.get(task, prompts["summarize"])
    
    # Process
    inputs = processor(prompt, images=images)
    
    # Generate with appropriate parameters
    params = og.GeneratorParams(model)
    params.set_search_options(
        max_length=4096,
        temperature=0.3,  # Lower temperature for factual tasks
        top_p=0.8,
        repetition_penalty=1.1
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    print(f"\nTask: {task}")
    print(f"Document: {image_path}")
    print("\nResult:")
    print("-" * 50)
    
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(tokenizer.decode(new_token), end="", flush=True)
    
    print("\n" + "-" * 50)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--image", required=True, help="Document image path")
    parser.add_argument("--task", default="summarize",
                       choices=["summarize", "extract", "translate", "ocr"],
                       help="Analysis task")
    args = parser.parse_args()
    
    analyze_document(args.image, args.task)

Next Steps

Phi Vision Models

Explore Microsoft’s Phi vision models

Gemma Vision Models

Learn about Google’s Gemma vision models

Model Optimization

Optimize inference performance

Custom Models

Build custom vision models

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Overview

Architecture Details

Multi-Resolution Rotary Position Embedding (MRoPE)

Model Components

Building Qwen2.5-VL Models

Prerequisites

Model Export

Precision Options

Using Qwen2.5-VL

Basic Usage

Multi-Image Processing

Chat Conversation with Images

Image Preprocessing

Automatic Resolution Handling

Grid Dimensions

Advanced Features

Custom Position IDs

Vision Pipeline Components

Performance Optimization

Implementation Details

RoPE Computation

Layer Normalization

Troubleshooting

Example: Document Understanding

Next Steps

Phi Vision Models

Gemma Vision Models

Model Optimization

Custom Models

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Overview

​Architecture Details

​Multi-Resolution Rotary Position Embedding (MRoPE)

​Model Components

​Building Qwen2.5-VL Models

​Prerequisites

​Model Export

​Precision Options

​Using Qwen2.5-VL

​Basic Usage

​Multi-Image Processing

​Chat Conversation with Images

​Image Preprocessing

​Automatic Resolution Handling

​Grid Dimensions

​Advanced Features

​Custom Position IDs

​Vision Pipeline Components

​Performance Optimization

​Implementation Details

​RoPE Computation

​Layer Normalization

​Troubleshooting

​Example: Document Understanding

​Next Steps

Phi Vision Models

Gemma Vision Models

Model Optimization

Custom Models

Build docs developers (and LLMs) love

Overview

Architecture Details

Multi-Resolution Rotary Position Embedding (MRoPE)

Model Components

Building Qwen2.5-VL Models

Prerequisites

Model Export

Precision Options

Using Qwen2.5-VL

Basic Usage

Multi-Image Processing

Chat Conversation with Images

Image Preprocessing

Automatic Resolution Handling

Grid Dimensions

Advanced Features

Custom Position IDs

Vision Pipeline Components

Performance Optimization

Implementation Details

RoPE Computation

Layer Normalization

Troubleshooting

Example: Document Understanding

Next Steps