Skip to main content
Gemma-3 vision models are Google’s family of open-source multi-modal models that combine visual understanding with powerful language capabilities. Available in multiple sizes, Gemma-3 vision models offer flexibility for different deployment scenarios.

Model Sizes

Gemma-3 vision is available in three parameter sizes:

4B Parameters

Lightweight model for resource-constrained environments

12B Parameters

Balanced model for production deployments

27B Parameters

Largest model for maximum performance

Features

  • Multi-image support: Process multiple images simultaneously
  • High-quality vision encoding: Advanced image understanding capabilities
  • Flexible precision: Support for FP32, FP16, and BF16
  • Efficient architecture: Optimized for both quality and performance
  • Open source: Fully open-source with commercial license

Prerequisites

Gemma-3 vision requires nightly versions of ONNX Runtime and specific dependency versions.

Install Dependencies

1

Install ONNX Runtime GenAI Nightly

# Install nightly ONNX Runtime GenAI for CUDA
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-genai-cuda

# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime-gpu

# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-gpu
2

Install PyTorch and Dependencies

# Install PyTorch (>= 2.7.0 required)
pip install torch==2.7.0 torchvision

# Install additional dependencies
pip install transformers
pip install pillow
pip install requests
pip install numpy==1.26.4  # Must be < 2.0.0
pip install --pre onnxscript
pip install huggingface_hub[cli]

Building Gemma-3 Vision Models

1

Download Base Model

Choose your desired model size and download from Hugging Face:
mkdir -p gemma3-vision-it/pytorch
cd gemma3-vision-it/pytorch
huggingface-cli download google/gemma-3-4b-it --local-dir .
2

Download Modified ONNX Files

cd ..
huggingface-cli download onnxruntime/Gemma-3-ONNX \
  --include onnx/* --local-dir .
3

Replace Modeling Files

Replace the original files with ONNX-compatible versions:
# Replace config (adds eager attention)
# Replace {size} with: 4b, 12b, or 27b
rm pytorch/config.json
mv onnx/{size}/config.json pytorch/

# Copy configuration helper
mv onnx/configuration_gemma3.py pytorch/

# Copy modified modeling file
mv onnx/modeling_gemma3.py pytorch/

# Move builder script
mv onnx/builder.py .

# Clean up
rm -rf onnx/
4

Build ONNX Models

Build INT4 quantized models for optimal performance:
python3 builder.py \
  --input ./pytorch \
  --output ./cpu \
  --precision fp32 \
  --execution_provider cpu
5

Add Configuration Files

Download the required configuration files based on your model size:

Using Gemma-3 Vision Models

Basic Image Understanding

import onnxruntime_genai as og

# Load model
config = og.Config("./gemma3-vision-it/cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Load image
images = og.Images.open("photo.jpg")

# Create prompt
prompt = "Describe what you see in this image."

# Process inputs
inputs = processor(prompt, images=images)

# Set generation parameters
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=2048,
    do_sample=True,
    top_p=0.9,
    temperature=0.7
)

# Generate response
generator = og.Generator(model, params)
generator.set_inputs(inputs)

print("Response: ", end="", flush=True)
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)
print()

Multi-Image Analysis

Gemma-3 vision can analyze multiple images simultaneously:
import onnxruntime_genai as og

# Load multiple images
images = og.Images.open(
    "before.jpg",
    "after.jpg"
)

# Ask comparative question
prompt = "Compare these two images and describe what has changed."

# Process and generate
inputs = processor(prompt, images=images)

params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)

Interactive Chat with Vision

import onnxruntime_genai as og

class GemmaVisionChat:
    def __init__(self, model_path):
        self.config = og.Config(model_path)
        self.model = og.Model(self.config)
        self.processor = self.model.create_multimodal_processor()
        self.tokenizer = og.Tokenizer(self.model)
        self.conversation_history = []
    
    def chat(self, message, image_path=None):
        # Load image if provided
        images = None
        if image_path:
            images = og.Images.open(image_path)
        
        # Add user message
        self.conversation_history.append({
            "role": "user",
            "content": message
        })
        
        # Process inputs
        inputs = self.processor(message, images=images)
        
        # Generate response
        params = og.GeneratorParams(self.model)
        params.set_search_options(
            max_length=2048,
            temperature=0.7,
            top_p=0.9
        )
        
        generator = og.Generator(self.model, params)
        generator.set_inputs(inputs)
        
        response = ""
        print("Assistant: ", end="", flush=True)
        while not generator.is_done():
            generator.generate_next_token()
            new_token = generator.get_next_tokens()[0]
            token_text = self.tokenizer.decode(new_token)
            response += token_text
            print(token_text, end="", flush=True)
        print()
        
        # Add assistant response
        self.conversation_history.append({
            "role": "assistant",
            "content": response
        })
        
        return response

# Usage
chat = GemmaVisionChat("./gemma3-vision-it/cuda")

chat.chat("Hello! Can you help me analyze images?", None)
chat.chat("What's in this image?", "image1.jpg")
chat.chat("What about this one?", "image2.jpg")
chat.chat("How do they compare?")

Advanced Usage

Batch Processing

Process multiple image-text pairs efficiently:
import onnxruntime_genai as og
from typing import List, Tuple

def batch_process_images(
    model_path: str,
    image_prompt_pairs: List[Tuple[str, str]]
) -> List[str]:
    """Process multiple image-prompt pairs.
    
    Args:
        model_path: Path to ONNX model
        image_prompt_pairs: List of (image_path, prompt) tuples
    
    Returns:
        List of generated responses
    """
    # Load model
    config = og.Config(model_path)
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    results = []
    
    for image_path, prompt in image_prompt_pairs:
        # Load image
        images = og.Images.open(image_path)
        
        # Process
        inputs = processor(prompt, images=images)
        
        # Generate
        params = og.GeneratorParams(model)
        params.set_search_options(max_length=2048)
        
        generator = og.Generator(model, params)
        generator.set_inputs(inputs)
        
        response = ""
        while not generator.is_done():
            generator.generate_next_token()
            new_token = generator.get_next_tokens()[0]
            response += tokenizer.decode(new_token)
        
        results.append(response)
        print(f"Processed: {image_path}")
    
    return results

# Usage
pairs = [
    ("product1.jpg", "Describe this product."),
    ("product2.jpg", "What are the key features?"),
    ("product3.jpg", "Identify any defects.")
]

results = batch_process_images("./gemma3-vision-it/cuda", pairs)
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(result)

Structured Output

Generate structured responses (e.g., JSON):
import onnxruntime_genai as og
import json

def extract_structured_info(image_path: str, schema: dict) -> dict:
    """Extract structured information from an image."""
    
    # Load model
    config = og.Config("./gemma3-vision-it/cuda")
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    # Load image
    images = og.Images.open(image_path)
    
    # Create prompt with schema
    schema_str = json.dumps(schema, indent=2)
    prompt = f"""
Analyze this image and extract information according to this JSON schema:
{schema_str}

Provide your response as valid JSON matching this schema.
"""
    
    # Process
    inputs = processor(prompt, images=images)
    
    # Generate with lower temperature for more deterministic output
    params = og.GeneratorParams(model)
    params.set_search_options(
        max_length=2048,
        temperature=0.3,  # Lower for structured output
        top_p=0.9
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    response = ""
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        response += tokenizer.decode(new_token)
    
    # Parse JSON from response
    try:
        # Extract JSON from response
        start = response.find('{')
        end = response.rfind('}') + 1
        json_str = response[start:end]
        return json.loads(json_str)
    except (json.JSONDecodeError, ValueError) as e:
        print(f"Failed to parse JSON: {e}")
        return {"raw_response": response}

# Usage
schema = {
    "objects": ["list of objects"],
    "scene": "description of scene",
    "colors": ["dominant colors"],
    "text": "any text visible in image"
}

result = extract_structured_info("scene.jpg", schema)
print(json.dumps(result, indent=2))

Performance Optimization

Model Size Selection

Choose the right model size for your use case:
Best for:
  • Edge devices
  • Real-time applications
  • Resource-constrained environments
  • Quick prototyping
Performance:
  • Fastest inference
  • Lowest memory usage (~8GB GPU)
  • Good quality for most tasks

Precision Comparison

PrecisionSpeedMemoryQualityHardware Required
FP321x1xBestAny
FP162x0.5xVery GoodModern GPUs
BF162x0.5xExcellentA100/H100
INT44x0.25xGoodAny
INT4 quantization is applied automatically during the build process and offers the best trade-off between speed, memory, and quality.

Execution Provider Tips

config = og.Config("./gemma3-vision-it/cuda")
config.clear_providers()
config.append_provider("cuda")

# Additional CUDA options can be set via environment variables:
# export ORT_CUDA_GEMM_OPTIONS=1
# export ORT_CUDA_CUDNN_CONV_ALGO_SEARCH=EXHAUSTIVE

model = og.Model(config)
config = og.Config("./gemma3-vision-it/dml")
config.clear_providers()
config.append_provider("dml")

# DirectML automatically selects best GPU
# For multi-GPU systems, set device ID:
# export ORT_DIRECTML_DEVICE_ID=0

model = og.Model(config)
import os

# Set thread count for optimal CPU performance
num_threads = os.cpu_count()
os.environ['OMP_NUM_THREADS'] = str(num_threads)

config = og.Config("./gemma3-vision-it/cpu")
model = og.Model(config)

Example Application: Image Captioning Service

import onnxruntime_genai as og
from typing import Optional
import argparse

class ImageCaptioner:
    """Image captioning service using Gemma-3 vision."""
    
    def __init__(self, model_path: str, execution_provider: str = "cuda"):
        self.config = og.Config(model_path)
        
        # Set execution provider
        if execution_provider != "follow_config":
            self.config.clear_providers()
            if execution_provider != "cpu":
                self.config.append_provider(execution_provider)
        
        self.model = og.Model(self.config)
        self.processor = self.model.create_multimodal_processor()
        self.tokenizer = og.Tokenizer(self.model)
    
    def caption(
        self,
        image_path: str,
        style: str = "detailed",
        max_length: int = 2048
    ) -> str:
        """Generate caption for an image.
        
        Args:
            image_path: Path to image file
            style: Caption style ("detailed", "brief", "technical")
            max_length: Maximum caption length
        
        Returns:
            Generated caption
        """
        # Load image
        images = og.Images.open(image_path)
        
        # Create style-specific prompt
        prompts = {
            "detailed": "Provide a detailed description of this image, including objects, colors, composition, and atmosphere.",
            "brief": "Provide a brief, one-sentence caption for this image.",
            "technical": "Provide a technical analysis of this image, including camera settings, lighting, and composition techniques if visible."
        }
        
        prompt = prompts.get(style, prompts["detailed"])
        
        # Process
        inputs = self.processor(prompt, images=images)
        
        # Generate
        params = og.GeneratorParams(self.model)
        params.set_search_options(
            max_length=max_length,
            temperature=0.7,
            top_p=0.9
        )
        
        generator = og.Generator(self.model, params)
        generator.set_inputs(inputs)
        
        caption = ""
        while not generator.is_done():
            generator.generate_next_token()
            new_token = generator.get_next_tokens()[0]
            caption += self.tokenizer.decode(new_token)
        
        return caption.strip()

def main():
    parser = argparse.ArgumentParser(
        description="Generate image captions using Gemma-3 vision"
    )
    parser.add_argument("-m", "--model", required=True,
                       help="Path to ONNX model")
    parser.add_argument("-i", "--image", required=True,
                       help="Path to image")
    parser.add_argument("-s", "--style", default="detailed",
                       choices=["detailed", "brief", "technical"],
                       help="Caption style")
    parser.add_argument("-e", "--execution-provider", default="cuda",
                       choices=["cpu", "cuda", "dml"],
                       help="Execution provider")
    
    args = parser.parse_args()
    
    # Create captioner
    captioner = ImageCaptioner(args.model, args.execution_provider)
    
    # Generate caption
    print(f"Analyzing: {args.image}")
    print(f"Style: {args.style}")
    print("\nCaption:")
    print("-" * 60)
    caption = captioner.caption(args.image, args.style)
    print(caption)
    print("-" * 60)

if __name__ == "__main__":
    main()

Troubleshooting

If unsure which model size to use:
import psutil
import torch

# Check available GPU memory
if torch.cuda.is_available():
    gpu_mem_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU Memory: {gpu_mem_gb:.1f} GB")
    
    if gpu_mem_gb < 12:
        print("Recommended: 4B model")
    elif gpu_mem_gb < 24:
        print("Recommended: 12B model")
    else:
        print("Can use: 27B model")
else:
    # CPU - check system RAM
    ram_gb = psutil.virtual_memory().total / 1e9
    print(f"System RAM: {ram_gb:.1f} GB")
    print("Recommended: 4B model for CPU")
Ensure configuration files match your model size:
# Verify config matches model
cat genai_config.json | grep "model_type"
# Should show: "model_type": "gemma"

# Check model path in config
cat genai_config.json | grep "filename"
# Verify paths point to your ONNX files
# Verify all versions
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import numpy; print(f'NumPy: {numpy.__version__}')"
python -c "import onnxruntime_genai; print(f'ORT GenAI: {onnxruntime_genai.__version__}')"

# If versions are incorrect, reinstall:
pip install --upgrade --force-reinstall torch==2.7.0 numpy==1.26.4

Next Steps

Phi Vision Models

Explore Microsoft’s Phi vision models

Qwen Vision Models

Learn about Qwen’s advanced capabilities

Deployment Guide

Deploy models to production

API Reference

Explore the full API documentation

Build docs developers (and LLMs) love