Gemma Vision Models

Gemma-3 vision models are Google’s family of open-source multi-modal models that combine visual understanding with powerful language capabilities. Available in multiple sizes, Gemma-3 vision models offer flexibility for different deployment scenarios.

Model Sizes

Gemma-3 vision is available in three parameter sizes:

4B Parameters

Lightweight model for resource-constrained environments

12B Parameters

Balanced model for production deployments

27B Parameters

Largest model for maximum performance

Features

Multi-image support: Process multiple images simultaneously
High-quality vision encoding: Advanced image understanding capabilities
Flexible precision: Support for FP32, FP16, and BF16
Efficient architecture: Optimized for both quality and performance
Open source: Fully open-source with commercial license

Prerequisites

Gemma-3 vision requires nightly versions of ONNX Runtime and specific dependency versions.

Install Dependencies

Install ONNX Runtime GenAI Nightly

CUDA
DirectML
CPU

# Install nightly ONNX Runtime GenAI for CUDA
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-genai-cuda

# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime-gpu

# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-gpu

# Install nightly ONNX Runtime GenAI for DirectML
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-genai-directml

# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime-directml

# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-directml

# Install nightly ONNX Runtime GenAI for CPU
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-genai

# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime

# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime

Install PyTorch and Dependencies

# Install PyTorch (>= 2.7.0 required)
pip install torch==2.7.0 torchvision

# Install additional dependencies
pip install transformers
pip install pillow
pip install requests
pip install numpy==1.26.4  # Must be < 2.0.0
pip install --pre onnxscript
pip install huggingface_hub[cli]

Building Gemma-3 Vision Models

Download Base Model

Choose your desired model size and download from Hugging Face:

mkdir -p gemma3-vision-it/pytorch
cd gemma3-vision-it/pytorch
huggingface-cli download google/gemma-3-4b-it --local-dir .

Download Modified ONNX Files

cd ..
huggingface-cli download onnxruntime/Gemma-3-ONNX \
  --include onnx/* --local-dir .

Replace Modeling Files

Replace the original files with ONNX-compatible versions:

# Replace config (adds eager attention)
# Replace {size} with: 4b, 12b, or 27b
rm pytorch/config.json
mv onnx/{size}/config.json pytorch/

# Copy configuration helper
mv onnx/configuration_gemma3.py pytorch/

# Copy modified modeling file
mv onnx/modeling_gemma3.py pytorch/

# Move builder script
mv onnx/builder.py .

# Clean up
rm -rf onnx/

Build ONNX Models

Build INT4 quantized models for optimal performance:

CPU
CUDA (FP16)
CUDA (BF16)
DirectML

python3 builder.py \
  --input ./pytorch \
  --output ./cpu \
  --precision fp32 \
  --execution_provider cpu

python3 builder.py \
  --input ./pytorch \
  --output ./cuda \
  --precision fp16 \
  --execution_provider cuda

# BF16 requires A100, H100, or newer GPUs
python3 builder.py \
  --input ./pytorch \
  --output ./cuda \
  --precision bf16 \
  --execution_provider cuda

python3 builder.py \
  --input ./pytorch \
  --output ./dml \
  --precision fp16 \
  --execution_provider dml

Add Configuration Files

Download the required configuration files based on your model size:

For 4B: genai_config.json and processor_config.json
Modify the paths if you used different output directories

Using Gemma-3 Vision Models

Basic Image Understanding

import onnxruntime_genai as og

# Load model
config = og.Config("./gemma3-vision-it/cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Load image
images = og.Images.open("photo.jpg")

# Create prompt
prompt = "Describe what you see in this image."

# Process inputs
inputs = processor(prompt, images=images)

# Set generation parameters
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=2048,
    do_sample=True,
    top_p=0.9,
    temperature=0.7
)

# Generate response
generator = og.Generator(model, params)
generator.set_inputs(inputs)

print("Response: ", end="", flush=True)
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)
print()

Multi-Image Analysis

Gemma-3 vision can analyze multiple images simultaneously:

import onnxruntime_genai as og

# Load multiple images
images = og.Images.open(
    "before.jpg",
    "after.jpg"
)

# Ask comparative question
prompt = "Compare these two images and describe what has changed."

# Process and generate
inputs = processor(prompt, images=images)

params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)

Interactive Chat with Vision

import onnxruntime_genai as og

class GemmaVisionChat:
    def __init__(self, model_path):
        self.config = og.Config(model_path)
        self.model = og.Model(self.config)
        self.processor = self.model.create_multimodal_processor()
        self.tokenizer = og.Tokenizer(self.model)
        self.conversation_history = []
    
    def chat(self, message, image_path=None):
        # Load image if provided
        images = None
        if image_path:
            images = og.Images.open(image_path)
        
        # Add user message
        self.conversation_history.append({
            "role": "user",
            "content": message
        })
        
        # Process inputs
        inputs = self.processor(message, images=images)
        
        # Generate response
        params = og.GeneratorParams(self.model)
        params.set_search_options(
            max_length=2048,
            temperature=0.7,
            top_p=0.9
        )
        
        generator = og.Generator(self.model, params)
        generator.set_inputs(inputs)
        
        response = ""
        print("Assistant: ", end="", flush=True)
        while not generator.is_done():
            generator.generate_next_token()
            new_token = generator.get_next_tokens()[0]
            token_text = self.tokenizer.decode(new_token)
            response += token_text
            print(token_text, end="", flush=True)
        print()
        
        # Add assistant response
        self.conversation_history.append({
            "role": "assistant",
            "content": response
        })
        
        return response

# Usage
chat = GemmaVisionChat("./gemma3-vision-it/cuda")

chat.chat("Hello! Can you help me analyze images?", None)
chat.chat("What's in this image?", "image1.jpg")
chat.chat("What about this one?", "image2.jpg")
chat.chat("How do they compare?")

Advanced Usage

Batch Processing

Process multiple image-text pairs efficiently:

import onnxruntime_genai as og
from typing import List, Tuple

def batch_process_images(
    model_path: str,
    image_prompt_pairs: List[Tuple[str, str]]
) -> List[str]:
    """Process multiple image-prompt pairs.
    
    Args:
        model_path: Path to ONNX model
        image_prompt_pairs: List of (image_path, prompt) tuples
    
    Returns:
        List of generated responses
    """
    # Load model
    config = og.Config(model_path)
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    results = []
    
    for image_path, prompt in image_prompt_pairs:
        # Load image
        images = og.Images.open(image_path)
        
        # Process
        inputs = processor(prompt, images=images)
        
        # Generate
        params = og.GeneratorParams(model)
        params.set_search_options(max_length=2048)
        
        generator = og.Generator(model, params)
        generator.set_inputs(inputs)
        
        response = ""
        while not generator.is_done():
            generator.generate_next_token()
            new_token = generator.get_next_tokens()[0]
            response += tokenizer.decode(new_token)
        
        results.append(response)
        print(f"Processed: {image_path}")
    
    return results

# Usage
pairs = [
    ("product1.jpg", "Describe this product."),
    ("product2.jpg", "What are the key features?"),
    ("product3.jpg", "Identify any defects.")
]

results = batch_process_images("./gemma3-vision-it/cuda", pairs)
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(result)

Structured Output

Generate structured responses (e.g., JSON):

import onnxruntime_genai as og
import json

def extract_structured_info(image_path: str, schema: dict) -> dict:
    """Extract structured information from an image."""
    
    # Load model
    config = og.Config("./gemma3-vision-it/cuda")
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    # Load image
    images = og.Images.open(image_path)
    
    # Create prompt with schema
    schema_str = json.dumps(schema, indent=2)
    prompt = f"""
Analyze this image and extract information according to this JSON schema:
{schema_str}

Provide your response as valid JSON matching this schema.
"""
    
    # Process
    inputs = processor(prompt, images=images)
    
    # Generate with lower temperature for more deterministic output
    params = og.GeneratorParams(model)
    params.set_search_options(
        max_length=2048,
        temperature=0.3,  # Lower for structured output
        top_p=0.9
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    response = ""
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        response += tokenizer.decode(new_token)
    
    # Parse JSON from response
    try:
        # Extract JSON from response
        start = response.find('{')
        end = response.rfind('}') + 1
        json_str = response[start:end]
        return json.loads(json_str)
    except (json.JSONDecodeError, ValueError) as e:
        print(f"Failed to parse JSON: {e}")
        return {"raw_response": response}

# Usage
schema = {
    "objects": ["list of objects"],
    "scene": "description of scene",
    "colors": ["dominant colors"],
    "text": "any text visible in image"
}

result = extract_structured_info("scene.jpg", schema)
print(json.dumps(result, indent=2))

Performance Optimization

Model Size Selection

Choose the right model size for your use case:

4B - Lightweight
12B - Balanced
27B - Maximum Quality

Best for:

Edge devices
Real-time applications
Resource-constrained environments
Quick prototyping

Performance:

Fastest inference
Lowest memory usage (~8GB GPU)
Good quality for most tasks

Precision Comparison

Precision	Speed	Memory	Quality	Hardware Required
FP32	1x	1x	Best	Any
FP16	2x	0.5x	Very Good	Modern GPUs
BF16	2x	0.5x	Excellent	A100/H100
INT4	4x	0.25x	Good	Any

INT4 quantization is applied automatically during the build process and offers the best trade-off between speed, memory, and quality.

Execution Provider Tips

CUDA Optimization

config = og.Config("./gemma3-vision-it/cuda")
config.clear_providers()
config.append_provider("cuda")

# Additional CUDA options can be set via environment variables:
# export ORT_CUDA_GEMM_OPTIONS=1
# export ORT_CUDA_CUDNN_CONV_ALGO_SEARCH=EXHAUSTIVE

model = og.Model(config)

DirectML Optimization

config = og.Config("./gemma3-vision-it/dml")
config.clear_providers()
config.append_provider("dml")

# DirectML automatically selects best GPU
# For multi-GPU systems, set device ID:
# export ORT_DIRECTML_DEVICE_ID=0

model = og.Model(config)

CPU Optimization

import os

# Set thread count for optimal CPU performance
num_threads = os.cpu_count()
os.environ['OMP_NUM_THREADS'] = str(num_threads)

config = og.Config("./gemma3-vision-it/cpu")
model = og.Model(config)

Example Application: Image Captioning Service

import onnxruntime_genai as og
from typing import Optional
import argparse

class ImageCaptioner:
    """Image captioning service using Gemma-3 vision."""
    
    def __init__(self, model_path: str, execution_provider: str = "cuda"):
        self.config = og.Config(model_path)
        
        # Set execution provider
        if execution_provider != "follow_config":
            self.config.clear_providers()
            if execution_provider != "cpu":
                self.config.append_provider(execution_provider)
        
        self.model = og.Model(self.config)
        self.processor = self.model.create_multimodal_processor()
        self.tokenizer = og.Tokenizer(self.model)
    
    def caption(
        self,
        image_path: str,
        style: str = "detailed",
        max_length: int = 2048
    ) -> str:
        """Generate caption for an image.
        
        Args:
            image_path: Path to image file
            style: Caption style ("detailed", "brief", "technical")
            max_length: Maximum caption length
        
        Returns:
            Generated caption
        """
        # Load image
        images = og.Images.open(image_path)
        
        # Create style-specific prompt
        prompts = {
            "detailed": "Provide a detailed description of this image, including objects, colors, composition, and atmosphere.",
            "brief": "Provide a brief, one-sentence caption for this image.",
            "technical": "Provide a technical analysis of this image, including camera settings, lighting, and composition techniques if visible."
        }
        
        prompt = prompts.get(style, prompts["detailed"])
        
        # Process
        inputs = self.processor(prompt, images=images)
        
        # Generate
        params = og.GeneratorParams(self.model)
        params.set_search_options(
            max_length=max_length,
            temperature=0.7,
            top_p=0.9
        )
        
        generator = og.Generator(self.model, params)
        generator.set_inputs(inputs)
        
        caption = ""
        while not generator.is_done():
            generator.generate_next_token()
            new_token = generator.get_next_tokens()[0]
            caption += self.tokenizer.decode(new_token)
        
        return caption.strip()

def main():
    parser = argparse.ArgumentParser(
        description="Generate image captions using Gemma-3 vision"
    )
    parser.add_argument("-m", "--model", required=True,
                       help="Path to ONNX model")
    parser.add_argument("-i", "--image", required=True,
                       help="Path to image")
    parser.add_argument("-s", "--style", default="detailed",
                       choices=["detailed", "brief", "technical"],
                       help="Caption style")
    parser.add_argument("-e", "--execution-provider", default="cuda",
                       choices=["cpu", "cuda", "dml"],
                       help="Execution provider")
    
    args = parser.parse_args()
    
    # Create captioner
    captioner = ImageCaptioner(args.model, args.execution_provider)
    
    # Generate caption
    print(f"Analyzing: {args.image}")
    print(f"Style: {args.style}")
    print("\nCaption:")
    print("-" * 60)
    caption = captioner.caption(args.image, args.style)
    print(caption)
    print("-" * 60)

if __name__ == "__main__":
    main()

Troubleshooting

Model Size Selection

If unsure which model size to use:

import psutil
import torch

# Check available GPU memory
if torch.cuda.is_available():
    gpu_mem_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU Memory: {gpu_mem_gb:.1f} GB")
    
    if gpu_mem_gb < 12:
        print("Recommended: 4B model")
    elif gpu_mem_gb < 24:
        print("Recommended: 12B model")
    else:
        print("Can use: 27B model")
else:
    # CPU - check system RAM
    ram_gb = psutil.virtual_memory().total / 1e9
    print(f"System RAM: {ram_gb:.1f} GB")
    print("Recommended: 4B model for CPU")

Configuration File Errors

Ensure configuration files match your model size:

# Verify config matches model
cat genai_config.json | grep "model_type"
# Should show: "model_type": "gemma"

# Check model path in config
cat genai_config.json | grep "filename"
# Verify paths point to your ONNX files

Dependency Version Conflicts

# Verify all versions
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import numpy; print(f'NumPy: {numpy.__version__}')"
python -c "import onnxruntime_genai; print(f'ORT GenAI: {onnxruntime_genai.__version__}')"

# If versions are incorrect, reinstall:
pip install --upgrade --force-reinstall torch==2.7.0 numpy==1.26.4

Next Steps

Phi Vision Models

Explore Microsoft’s Phi vision models

Qwen Vision Models

Learn about Qwen’s advanced capabilities

Deployment Guide

Deploy models to production

API Reference

Explore the full API documentation

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Model Sizes

4B Parameters

12B Parameters

27B Parameters

Features

Prerequisites

Install Dependencies

Building Gemma-3 Vision Models

Using Gemma-3 Vision Models

Basic Image Understanding

Multi-Image Analysis

Interactive Chat with Vision

Advanced Usage

Batch Processing

Structured Output

Performance Optimization

Model Size Selection

Precision Comparison

Execution Provider Tips

Example Application: Image Captioning Service

Troubleshooting

Next Steps

Phi Vision Models

Qwen Vision Models

Deployment Guide

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Model Sizes

4B Parameters

12B Parameters

27B Parameters

​Features

​Prerequisites

​Install Dependencies

​Building Gemma-3 Vision Models

​Using Gemma-3 Vision Models

​Basic Image Understanding

​Multi-Image Analysis

​Interactive Chat with Vision

​Advanced Usage

​Batch Processing

​Structured Output

​Performance Optimization

​Model Size Selection

​Precision Comparison

​Execution Provider Tips

​Example Application: Image Captioning Service

​Troubleshooting

​Next Steps

Phi Vision Models

Qwen Vision Models

Deployment Guide

API Reference

Build docs developers (and LLMs) love

Model Sizes

Features

Prerequisites

Install Dependencies

Building Gemma-3 Vision Models

Using Gemma-3 Vision Models

Basic Image Understanding

Multi-Image Analysis

Interactive Chat with Vision

Advanced Usage

Batch Processing

Structured Output

Performance Optimization

Model Size Selection

Precision Comparison

Execution Provider Tips

Example Application: Image Captioning Service

Troubleshooting

Next Steps