Use Qwen’s advanced vision-language models with ONNX Runtime GenAI
Qwen2.5-VL is an advanced vision-language model that supports multi-image understanding, dynamic resolution, and sophisticated spatial reasoning through Multi-Resolution Rotary Position Embedding (MRoPE).
# FP16 for CUDA (recommended for most GPUs)python3 -m onnxruntime_genai.models.builder \ --model_name Qwen/Qwen2.5-VL-7B-Instruct \ --output ./onnx-fp16 \ --precision fp16 \ --execution_provider cuda
Qwen2.5-VL excels at reasoning across multiple images:
import onnxruntime_genai as og# Load multiple imagesimages = og.Images.open( "product_front.jpg", "product_back.jpg", "product_side.jpg")# Ask comparative questionprompt = """Analyze these three product images:1. What are the key features visible from different angles?2. Are there any defects or quality issues?3. How would you rate the product packaging?"""inputs = processor(prompt, images=images)# Generate detailed analysisparams = og.GeneratorParams(model)params.set_search_options( max_length=4096, temperature=0.7, top_p=0.9)generator = og.Generator(model, params)generator.set_inputs(inputs)while not generator.is_done(): generator.generate_next_token() new_token = generator.get_next_tokens()[0] print(tokenizer.decode(new_token), end="", flush=True)
Qwen2.5-VL automatically handles various image resolutions:
# High resolution imagehigh_res = og.Images.open("high_resolution_4k.jpg")# Low resolution image low_res = og.Images.open("thumbnail_128x128.jpg")# Both are automatically preprocessed to optimal resolutioninputs_high = processor("Analyze this high-res image", images=high_res)inputs_low = processor("Analyze this thumbnail", images=low_res)
For advanced use cases, you can work with 3D position IDs:
import torch# Example: Create custom 3D position IDs# Shape: [3, batch_size, sequence_length]batch_size = 1sequence_length = 100# Temporal, Height, Width dimensionsposition_ids = torch.zeros((3, batch_size, sequence_length), dtype=torch.int64)# For text tokens, all dimensions use same sequential IDsfor i in range(sequence_length): position_ids[0, 0, i] = i # Temporal position_ids[1, 0, i] = i # Height position_ids[2, 0, i] = i # Width# For image patches, dimensions vary based on spatial location# (automatically handled by the processor)
Choose the right execution provider for your hardware:
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")# For NVIDIA GPUsconfig.clear_providers()config.append_provider("cuda")# For DirectML (AMD/Intel)# config.append_provider("dml")model = og.Model(config)
Precision Trade-offs
Different precisions offer different trade-offs:
Precision
Speed
Memory
Accuracy
Hardware
FP32
Slow
High
Best
All
FP16
Fast
Medium
Good
Modern GPUs
BF16
Fast
Medium
Very Good
A100, H100
LayerNorm and RoPE are always computed in FP32 internally for numerical stability, regardless of model precision.
Batch Processing
Process multiple images efficiently:
# Single batch with multiple imagesimages = og.Images.open("img1.jpg", "img2.jpg", "img3.jpg")prompt = "Analyze all these images together."inputs = processor(prompt, images=images)# Batch size is automatically determined from inputsgenerator = og.Generator(model, params)generator.set_inputs(inputs)
Memory Management
For large images or long contexts:
# Monitor token countgenerator.set_inputs(inputs)total_tokens = generator.token_count()print(f"Total input tokens: {total_tokens}")# Adjust max_length based on available memoryavailable_memory_gb = 16 # Your GPU memoryif total_tokens > 2048: max_new_tokens = 1024 # Reduce for large inputselse: max_new_tokens = 2048params.set_search_options(max_length=total_tokens + max_new_tokens)
# Qwen2.5-VL uses RMSNorm with forced FP32 computation# Regardless of model precision (FP16/BF16), normalization uses FP32# This is configured automatically in the builder:layernorm_attrs = { "cast": { "use_fp32": True, # Compute in FP32 "root_input": True, # Cast input to FP32 "skip_input": True, # Cast skip connection to FP32 "output_0": True, # Cast output back to model dtype "output_3": True # Cast residual output back }}
# Verify position_ids shape is [3, batch_size, sequence_length]# This is handled automatically by the processor# If manually creating inputs, ensure correct shape:position_ids = torch.zeros((3, 1, seq_len), dtype=torch.int64)
OOM with High Resolution Images
For very high resolution images:
# The model automatically handles resolution# But you can pre-resize large images if needed:from PIL import Imageimg = Image.open("very_large_image.jpg")max_size = 2048if max(img.size) > max_size: img.thumbnail((max_size, max_size), Image.LANCZOS) img.save("resized.jpg")images = og.Images.open("resized.jpg")