Skip to main content
Qwen2.5-VL is an advanced vision-language model that supports multi-image understanding, dynamic resolution, and sophisticated spatial reasoning through Multi-Resolution Rotary Position Embedding (MRoPE).

Overview

Qwen2.5-VL models are state-of-the-art vision-language models from Alibaba Cloud that excel at:
  • Multi-image understanding: Process and reason across multiple images
  • Dynamic resolution: Handle images of varying sizes and aspect ratios
  • 3D positional encoding: MRoPE for better spatial understanding
  • Long context: Support for extended context lengths
Qwen2.5-VL uses a unique 3D position encoding scheme with temporal, height, and width dimensions for superior spatial reasoning.

Architecture Details

Multi-Resolution Rotary Position Embedding (MRoPE)

Qwen2.5-VL uses MRoPE to encode positional information in three dimensions:
# Position IDs shape: [3, batch_size, sequence_length]
# Dimension 0: Temporal dimension
# Dimension 1: Height dimension  
# Dimension 2: Width dimension
This 3D encoding allows the model to:
  • Better understand spatial relationships in images
  • Handle dynamic image resolutions
  • Process multi-image inputs with proper position awareness

Model Components

The Qwen2.5-VL architecture consists of:
  1. Vision Encoder: Processes images into visual features
    • Patch embedding for image tokenization
    • Vision attention layers
    • Patch merger for feature aggregation
  2. Language Model: Core text generation model
    • Modified attention with MRoPE
    • Grouped Query Attention (GQA)
    • RMS Layer Normalization (always computed in FP32)
  3. Vision Pipeline: Multi-stage processing
    Image → Patch Embed → Vision Attention → Patch Merger → Embeddings
    

Building Qwen2.5-VL Models

Qwen2.5-VL requires specific versions of dependencies. Follow the installation steps carefully.

Prerequisites

1

Install ONNX Runtime

# Install nightly ONNX Runtime GenAI for CUDA
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-genai-cuda

# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime-gpu

# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-gpu
2

Install PyTorch

# Ensure PyTorch >= 2.7.0
pip install torch==2.7.0
3

Install Additional Dependencies

pip install transformers
pip install pillow
pip install numpy==1.26.4  # Must be < 2.0.0
pip install --pre onnxscript

Model Export

1

Download Base Model

mkdir -p qwen2.5-vl-7b-instruct
cd qwen2.5-vl-7b-instruct

# Download from Hugging Face
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct \
  --local-dir ./pytorch
2

Export to ONNX

# Use the builder from ONNX Runtime GenAI source
from onnxruntime_genai.models.builder import build_model

build_model(
    model_name="Qwen/Qwen2.5-VL-7B-Instruct",
    input_path="./pytorch",
    output_path="./onnx-cuda",
    precision="fp16",
    execution_provider="cuda",
    cache_dir="./cache"
)

Precision Options

# FP16 for CUDA (recommended for most GPUs)
python3 -m onnxruntime_genai.models.builder \
  --model_name Qwen/Qwen2.5-VL-7B-Instruct \
  --output ./onnx-fp16 \
  --precision fp16 \
  --execution_provider cuda

Using Qwen2.5-VL

Basic Usage

import onnxruntime_genai as og

# Load model
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Load image
images = og.Images.open("image.jpg")

# Create prompt
prompt = "Describe this image in detail."

# Process inputs
inputs = processor(prompt, images=images)

# Generate response
params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

print("Response: ", end="", flush=True)
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)
print()

Multi-Image Processing

Qwen2.5-VL excels at reasoning across multiple images:
import onnxruntime_genai as og

# Load multiple images
images = og.Images.open(
    "product_front.jpg",
    "product_back.jpg",
    "product_side.jpg"
)

# Ask comparative question
prompt = """
Analyze these three product images:
1. What are the key features visible from different angles?
2. Are there any defects or quality issues?
3. How would you rate the product packaging?
"""

inputs = processor(prompt, images=images)

# Generate detailed analysis
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=4096,
    temperature=0.7,
    top_p=0.9
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)

Chat Conversation with Images

import json
import onnxruntime_genai as og

# Initialize model
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Chat history
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can see images."
    },
    {
        "role": "user",
        "content": "What objects can you identify in this image?"
    }
]

# Load image
images = og.Images.open("scene.jpg")

# Convert messages to prompt
messages_json = json.dumps(messages)
prompt = messages_json  # Processor handles chat template

# Process and generate
inputs = processor(prompt, images=images)

params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

response = ""
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    token_text = tokenizer.decode(new_token)
    response += token_text
    print(token_text, end="", flush=True)

# Add assistant response to history
messages.append({"role": "assistant", "content": response})

Image Preprocessing

Automatic Resolution Handling

Qwen2.5-VL automatically handles various image resolutions:
# High resolution image
high_res = og.Images.open("high_resolution_4k.jpg")

# Low resolution image  
low_res = og.Images.open("thumbnail_128x128.jpg")

# Both are automatically preprocessed to optimal resolution
inputs_high = processor("Analyze this high-res image", images=high_res)
inputs_low = processor("Analyze this thumbnail", images=low_res)

Grid Dimensions

The model uses grid-based image processing with temporal, height, and width dimensions:
# Grid dimensions are automatically calculated based on image size
# For a 1024x768 image with patch size 14:
# - Temporal: 1 (single image)
# - Height: 1024 / 14 ≈ 73 patches
# - Width: 768 / 14 ≈ 55 patches

Advanced Features

Custom Position IDs

For advanced use cases, you can work with 3D position IDs:
import torch

# Example: Create custom 3D position IDs
# Shape: [3, batch_size, sequence_length]
batch_size = 1
sequence_length = 100

# Temporal, Height, Width dimensions
position_ids = torch.zeros((3, batch_size, sequence_length), dtype=torch.int64)

# For text tokens, all dimensions use same sequential IDs
for i in range(sequence_length):
    position_ids[0, 0, i] = i  # Temporal
    position_ids[1, 0, i] = i  # Height
    position_ids[2, 0, i] = i  # Width

# For image patches, dimensions vary based on spatial location
# (automatically handled by the processor)

Vision Pipeline Components

Access individual vision pipeline components:
# The vision pipeline consists of:
# 1. Patch Embedding: Image -> Patches
# 2. Vision Attention: Process patches with attention
# 3. Patch Merger: Merge patches to reduce sequence length

# These are automatically orchestrated by the processor
images = og.Images.open("image.jpg")
inputs = processor(prompt, images=images)

# The processor handles:
# - Patch extraction from images
# - Window-based attention reordering
# - Spatial merge operations
# - Final embedding generation

Performance Optimization

Choose the right execution provider for your hardware:
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")

# For NVIDIA GPUs
config.clear_providers()
config.append_provider("cuda")

# For DirectML (AMD/Intel)
# config.append_provider("dml")

model = og.Model(config)
Different precisions offer different trade-offs:
PrecisionSpeedMemoryAccuracyHardware
FP32SlowHighBestAll
FP16FastMediumGoodModern GPUs
BF16FastMediumVery GoodA100, H100
LayerNorm and RoPE are always computed in FP32 internally for numerical stability, regardless of model precision.
Process multiple images efficiently:
# Single batch with multiple images
images = og.Images.open("img1.jpg", "img2.jpg", "img3.jpg")
prompt = "Analyze all these images together."
inputs = processor(prompt, images=images)

# Batch size is automatically determined from inputs
generator = og.Generator(model, params)
generator.set_inputs(inputs)
For large images or long contexts:
# Monitor token count
generator.set_inputs(inputs)
total_tokens = generator.token_count()
print(f"Total input tokens: {total_tokens}")

# Adjust max_length based on available memory
available_memory_gb = 16  # Your GPU memory
if total_tokens > 2048:
    max_new_tokens = 1024  # Reduce for large inputs
else:
    max_new_tokens = 2048

params.set_search_options(max_length=total_tokens + max_new_tokens)

Implementation Details

RoPE Computation

Qwen2.5-VL uses a custom MRoPE implementation:
# Pseudo-code for MRoPE logic
# 1. Calculate dynamic RoPE caches from 3D position_ids
cos_cache, sin_cache = make_dynamic_rope_caches(position_ids)
# Shape: [3, batch_size, sequence_length, head_dim]

# 2. Flatten and reorder based on MRoPE sections
flat_cos, flat_sin = make_mrope_flattened_caches(cos_cache, sin_cache)
# Shape: [batch_size * sequence_length, head_dim / 2]

# 3. Apply rotation to Q and K
q_rotated = apply_mrope_rotation(q, flat_cos, flat_sin)
k_rotated = apply_mrope_rotation(k, flat_cos, flat_sin)

# 4. Grouped Query Attention
output = grouped_query_attention(q_rotated, k_rotated, v)

Layer Normalization

# Qwen2.5-VL uses RMSNorm with forced FP32 computation
# Regardless of model precision (FP16/BF16), normalization uses FP32

# This is configured automatically in the builder:
layernorm_attrs = {
    "cast": {
        "use_fp32": True,      # Compute in FP32
        "root_input": True,    # Cast input to FP32
        "skip_input": True,    # Cast skip connection to FP32
        "output_0": True,      # Cast output back to model dtype
        "output_3": True       # Cast residual output back
    }
}

Troubleshooting

If you see NumPy 2.0 compatibility errors:
# Uninstall numpy 2.0+
pip uninstall -y numpy

# Install compatible version
pip install numpy==1.26.4
Qwen2.5-VL requires PyTorch >= 2.7.0:
# Check version
python -c "import torch; print(torch.__version__)"

# Upgrade if needed
pip install --upgrade torch==2.7.0
If you encounter position_ids shape errors:
# Verify position_ids shape is [3, batch_size, sequence_length]
# This is handled automatically by the processor

# If manually creating inputs, ensure correct shape:
position_ids = torch.zeros((3, 1, seq_len), dtype=torch.int64)
For very high resolution images:
# The model automatically handles resolution
# But you can pre-resize large images if needed:

from PIL import Image

img = Image.open("very_large_image.jpg")
max_size = 2048
if max(img.size) > max_size:
    img.thumbnail((max_size, max_size), Image.LANCZOS)
    img.save("resized.jpg")

images = og.Images.open("resized.jpg")

Example: Document Understanding

import onnxruntime_genai as og
import argparse

def analyze_document(image_path, task="summarize"):
    """Analyze a document image with Qwen2.5-VL."""
    
    # Load model
    config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    # Load document
    images = og.Images.open(image_path)
    
    # Create task-specific prompt
    prompts = {
        "summarize": "Summarize the key points from this document.",
        "extract": "Extract all important information including names, dates, and numbers.",
        "translate": "Translate the text in this document to English.",
        "ocr": "Extract all text from this document."
    }
    
    prompt = prompts.get(task, prompts["summarize"])
    
    # Process
    inputs = processor(prompt, images=images)
    
    # Generate with appropriate parameters
    params = og.GeneratorParams(model)
    params.set_search_options(
        max_length=4096,
        temperature=0.3,  # Lower temperature for factual tasks
        top_p=0.8,
        repetition_penalty=1.1
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    print(f"\nTask: {task}")
    print(f"Document: {image_path}")
    print("\nResult:")
    print("-" * 50)
    
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(tokenizer.decode(new_token), end="", flush=True)
    
    print("\n" + "-" * 50)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--image", required=True, help="Document image path")
    parser.add_argument("--task", default="summarize",
                       choices=["summarize", "extract", "translate", "ocr"],
                       help="Analysis task")
    args = parser.parse_args()
    
    analyze_document(args.image, args.task)

Next Steps

Phi Vision Models

Explore Microsoft’s Phi vision models

Gemma Vision Models

Learn about Google’s Gemma vision models

Model Optimization

Optimize inference performance

Custom Models

Build custom vision models

Build docs developers (and LLMs) love