Gemma-3 vision models are Google’s family of open-source multi-modal models that combine visual understanding with powerful language capabilities. Available in multiple sizes, Gemma-3 vision models offer flexibility for different deployment scenarios.
Model Sizes
Gemma-3 vision is available in three parameter sizes:
4B Parameters Lightweight model for resource-constrained environments
12B Parameters Balanced model for production deployments
27B Parameters Largest model for maximum performance
Features
Multi-image support : Process multiple images simultaneously
High-quality vision encoding : Advanced image understanding capabilities
Flexible precision : Support for FP32, FP16, and BF16
Efficient architecture : Optimized for both quality and performance
Open source : Fully open-source with commercial license
Prerequisites
Gemma-3 vision requires nightly versions of ONNX Runtime and specific dependency versions.
Install Dependencies
Install ONNX Runtime GenAI Nightly
# Install nightly ONNX Runtime GenAI for CUDA
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
--pre onnxruntime-genai-cuda
# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime-gpu
# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
--pre onnxruntime-gpu
# Install nightly ONNX Runtime GenAI for DirectML
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
--pre onnxruntime-genai-directml
# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime-directml
# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
--pre onnxruntime-directml
# Install nightly ONNX Runtime GenAI for CPU
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
--pre onnxruntime-genai
# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime
# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
--pre onnxruntime
Install PyTorch and Dependencies
# Install PyTorch (>= 2.7.0 required)
pip install torch== 2.7.0 torchvision
# Install additional dependencies
pip install transformers
pip install pillow
pip install requests
pip install numpy== 1.26.4 # Must be < 2.0.0
pip install --pre onnxscript
pip install huggingface_hub[cli]
Building Gemma-3 Vision Models
Download Base Model
Choose your desired model size and download from Hugging Face: 4B Model
12B Model
27B Model
mkdir -p gemma3-vision-it/pytorch
cd gemma3-vision-it/pytorch
huggingface-cli download google/gemma-3-4b-it --local-dir .
Download Modified ONNX Files
cd ..
huggingface-cli download onnxruntime/Gemma-3-ONNX \
--include onnx/ * --local-dir .
Replace Modeling Files
Replace the original files with ONNX-compatible versions: # Replace config (adds eager attention)
# Replace {size} with: 4b, 12b, or 27b
rm pytorch/config.json
mv onnx/{size}/config.json pytorch/
# Copy configuration helper
mv onnx/configuration_gemma3.py pytorch/
# Copy modified modeling file
mv onnx/modeling_gemma3.py pytorch/
# Move builder script
mv onnx/builder.py .
# Clean up
rm -rf onnx/
Build ONNX Models
Build INT4 quantized models for optimal performance: CPU
CUDA (FP16)
CUDA (BF16)
DirectML
python3 builder.py \
--input ./pytorch \
--output ./cpu \
--precision fp32 \
--execution_provider cpu
python3 builder.py \
--input ./pytorch \
--output ./cuda \
--precision fp16 \
--execution_provider cuda
# BF16 requires A100, H100, or newer GPUs
python3 builder.py \
--input ./pytorch \
--output ./cuda \
--precision bf16 \
--execution_provider cuda
python3 builder.py \
--input ./pytorch \
--output ./dml \
--precision fp16 \
--execution_provider dml
Add Configuration Files
Download the required configuration files based on your model size:
Using Gemma-3 Vision Models
Basic Image Understanding
import onnxruntime_genai as og
# Load model
config = og.Config( "./gemma3-vision-it/cuda" )
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)
# Load image
images = og.Images.open( "photo.jpg" )
# Create prompt
prompt = "Describe what you see in this image."
# Process inputs
inputs = processor(prompt, images = images)
# Set generation parameters
params = og.GeneratorParams(model)
params.set_search_options(
max_length = 2048 ,
do_sample = True ,
top_p = 0.9 ,
temperature = 0.7
)
# Generate response
generator = og.Generator(model, params)
generator.set_inputs(inputs)
print ( "Response: " , end = "" , flush = True )
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[ 0 ]
print (tokenizer.decode(new_token), end = "" , flush = True )
print ()
Multi-Image Analysis
Gemma-3 vision can analyze multiple images simultaneously:
import onnxruntime_genai as og
# Load multiple images
images = og.Images.open(
"before.jpg" ,
"after.jpg"
)
# Ask comparative question
prompt = "Compare these two images and describe what has changed."
# Process and generate
inputs = processor(prompt, images = images)
params = og.GeneratorParams(model)
params.set_search_options( max_length = 2048 )
generator = og.Generator(model, params)
generator.set_inputs(inputs)
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[ 0 ]
print (tokenizer.decode(new_token), end = "" , flush = True )
Interactive Chat with Vision
import onnxruntime_genai as og
class GemmaVisionChat :
def __init__ ( self , model_path ):
self .config = og.Config(model_path)
self .model = og.Model( self .config)
self .processor = self .model.create_multimodal_processor()
self .tokenizer = og.Tokenizer( self .model)
self .conversation_history = []
def chat ( self , message , image_path = None ):
# Load image if provided
images = None
if image_path:
images = og.Images.open(image_path)
# Add user message
self .conversation_history.append({
"role" : "user" ,
"content" : message
})
# Process inputs
inputs = self .processor(message, images = images)
# Generate response
params = og.GeneratorParams( self .model)
params.set_search_options(
max_length = 2048 ,
temperature = 0.7 ,
top_p = 0.9
)
generator = og.Generator( self .model, params)
generator.set_inputs(inputs)
response = ""
print ( "Assistant: " , end = "" , flush = True )
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[ 0 ]
token_text = self .tokenizer.decode(new_token)
response += token_text
print (token_text, end = "" , flush = True )
print ()
# Add assistant response
self .conversation_history.append({
"role" : "assistant" ,
"content" : response
})
return response
# Usage
chat = GemmaVisionChat( "./gemma3-vision-it/cuda" )
chat.chat( "Hello! Can you help me analyze images?" , None )
chat.chat( "What's in this image?" , "image1.jpg" )
chat.chat( "What about this one?" , "image2.jpg" )
chat.chat( "How do they compare?" )
Advanced Usage
Batch Processing
Process multiple image-text pairs efficiently:
import onnxruntime_genai as og
from typing import List, Tuple
def batch_process_images (
model_path : str ,
image_prompt_pairs : List[Tuple[ str , str ]]
) -> List[ str ]:
"""Process multiple image-prompt pairs.
Args:
model_path: Path to ONNX model
image_prompt_pairs: List of (image_path, prompt) tuples
Returns:
List of generated responses
"""
# Load model
config = og.Config(model_path)
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)
results = []
for image_path, prompt in image_prompt_pairs:
# Load image
images = og.Images.open(image_path)
# Process
inputs = processor(prompt, images = images)
# Generate
params = og.GeneratorParams(model)
params.set_search_options( max_length = 2048 )
generator = og.Generator(model, params)
generator.set_inputs(inputs)
response = ""
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[ 0 ]
response += tokenizer.decode(new_token)
results.append(response)
print ( f "Processed: { image_path } " )
return results
# Usage
pairs = [
( "product1.jpg" , "Describe this product." ),
( "product2.jpg" , "What are the key features?" ),
( "product3.jpg" , "Identify any defects." )
]
results = batch_process_images( "./gemma3-vision-it/cuda" , pairs)
for i, result in enumerate (results, 1 ):
print ( f " \n Result { i } :" )
print (result)
Structured Output
Generate structured responses (e.g., JSON):
import onnxruntime_genai as og
import json
def extract_structured_info ( image_path : str , schema : dict ) -> dict :
"""Extract structured information from an image."""
# Load model
config = og.Config( "./gemma3-vision-it/cuda" )
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)
# Load image
images = og.Images.open(image_path)
# Create prompt with schema
schema_str = json.dumps(schema, indent = 2 )
prompt = f """
Analyze this image and extract information according to this JSON schema:
{ schema_str }
Provide your response as valid JSON matching this schema.
"""
# Process
inputs = processor(prompt, images = images)
# Generate with lower temperature for more deterministic output
params = og.GeneratorParams(model)
params.set_search_options(
max_length = 2048 ,
temperature = 0.3 , # Lower for structured output
top_p = 0.9
)
generator = og.Generator(model, params)
generator.set_inputs(inputs)
response = ""
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[ 0 ]
response += tokenizer.decode(new_token)
# Parse JSON from response
try :
# Extract JSON from response
start = response.find( '{' )
end = response.rfind( '}' ) + 1
json_str = response[start:end]
return json.loads(json_str)
except (json.JSONDecodeError, ValueError ) as e:
print ( f "Failed to parse JSON: { e } " )
return { "raw_response" : response}
# Usage
schema = {
"objects" : [ "list of objects" ],
"scene" : "description of scene" ,
"colors" : [ "dominant colors" ],
"text" : "any text visible in image"
}
result = extract_structured_info( "scene.jpg" , schema)
print (json.dumps(result, indent = 2 ))
Model Size Selection
Choose the right model size for your use case:
4B - Lightweight
12B - Balanced
27B - Maximum Quality
Best for:
Edge devices
Real-time applications
Resource-constrained environments
Quick prototyping
Performance:
Fastest inference
Lowest memory usage (~8GB GPU)
Good quality for most tasks
Best for:
Production deployments
Cloud inference
Balanced performance/quality
General-purpose vision tasks
Performance:
Moderate inference speed
Moderate memory usage (~16GB GPU)
Better quality than 4B
Best for:
High-accuracy requirements
Complex reasoning tasks
Research and development
Offline batch processing
Performance:
Slower inference
High memory usage (~32GB GPU)
Best quality and reasoning
Precision Comparison
Precision Speed Memory Quality Hardware Required FP32 1x 1x Best Any FP16 2x 0.5x Very Good Modern GPUs BF16 2x 0.5x Excellent A100/H100 INT4 4x 0.25x Good Any
INT4 quantization is applied automatically during the build process and offers the best trade-off between speed, memory, and quality.
Execution Provider Tips
config = og.Config( "./gemma3-vision-it/cuda" )
config.clear_providers()
config.append_provider( "cuda" )
# Additional CUDA options can be set via environment variables:
# export ORT_CUDA_GEMM_OPTIONS=1
# export ORT_CUDA_CUDNN_CONV_ALGO_SEARCH=EXHAUSTIVE
model = og.Model(config)
config = og.Config( "./gemma3-vision-it/dml" )
config.clear_providers()
config.append_provider( "dml" )
# DirectML automatically selects best GPU
# For multi-GPU systems, set device ID:
# export ORT_DIRECTML_DEVICE_ID=0
model = og.Model(config)
import os
# Set thread count for optimal CPU performance
num_threads = os.cpu_count()
os.environ[ 'OMP_NUM_THREADS' ] = str (num_threads)
config = og.Config( "./gemma3-vision-it/cpu" )
model = og.Model(config)
Example Application: Image Captioning Service
import onnxruntime_genai as og
from typing import Optional
import argparse
class ImageCaptioner :
"""Image captioning service using Gemma-3 vision."""
def __init__ ( self , model_path : str , execution_provider : str = "cuda" ):
self .config = og.Config(model_path)
# Set execution provider
if execution_provider != "follow_config" :
self .config.clear_providers()
if execution_provider != "cpu" :
self .config.append_provider(execution_provider)
self .model = og.Model( self .config)
self .processor = self .model.create_multimodal_processor()
self .tokenizer = og.Tokenizer( self .model)
def caption (
self ,
image_path : str ,
style : str = "detailed" ,
max_length : int = 2048
) -> str :
"""Generate caption for an image.
Args:
image_path: Path to image file
style: Caption style ("detailed", "brief", "technical")
max_length: Maximum caption length
Returns:
Generated caption
"""
# Load image
images = og.Images.open(image_path)
# Create style-specific prompt
prompts = {
"detailed" : "Provide a detailed description of this image, including objects, colors, composition, and atmosphere." ,
"brief" : "Provide a brief, one-sentence caption for this image." ,
"technical" : "Provide a technical analysis of this image, including camera settings, lighting, and composition techniques if visible."
}
prompt = prompts.get(style, prompts[ "detailed" ])
# Process
inputs = self .processor(prompt, images = images)
# Generate
params = og.GeneratorParams( self .model)
params.set_search_options(
max_length = max_length,
temperature = 0.7 ,
top_p = 0.9
)
generator = og.Generator( self .model, params)
generator.set_inputs(inputs)
caption = ""
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[ 0 ]
caption += self .tokenizer.decode(new_token)
return caption.strip()
def main ():
parser = argparse.ArgumentParser(
description = "Generate image captions using Gemma-3 vision"
)
parser.add_argument( "-m" , "--model" , required = True ,
help = "Path to ONNX model" )
parser.add_argument( "-i" , "--image" , required = True ,
help = "Path to image" )
parser.add_argument( "-s" , "--style" , default = "detailed" ,
choices = [ "detailed" , "brief" , "technical" ],
help = "Caption style" )
parser.add_argument( "-e" , "--execution-provider" , default = "cuda" ,
choices = [ "cpu" , "cuda" , "dml" ],
help = "Execution provider" )
args = parser.parse_args()
# Create captioner
captioner = ImageCaptioner(args.model, args.execution_provider)
# Generate caption
print ( f "Analyzing: { args.image } " )
print ( f "Style: { args.style } " )
print ( " \n Caption:" )
print ( "-" * 60 )
caption = captioner.caption(args.image, args.style)
print (caption)
print ( "-" * 60 )
if __name__ == "__main__" :
main()
Troubleshooting
If unsure which model size to use: import psutil
import torch
# Check available GPU memory
if torch.cuda.is_available():
gpu_mem_gb = torch.cuda.get_device_properties( 0 ).total_memory / 1e9
print ( f "GPU Memory: { gpu_mem_gb :.1f} GB" )
if gpu_mem_gb < 12 :
print ( "Recommended: 4B model" )
elif gpu_mem_gb < 24 :
print ( "Recommended: 12B model" )
else :
print ( "Can use: 27B model" )
else :
# CPU - check system RAM
ram_gb = psutil.virtual_memory().total / 1e9
print ( f "System RAM: { ram_gb :.1f} GB" )
print ( "Recommended: 4B model for CPU" )
Configuration File Errors
Ensure configuration files match your model size: # Verify config matches model
cat genai_config.json | grep "model_type"
# Should show: "model_type": "gemma"
# Check model path in config
cat genai_config.json | grep "filename"
# Verify paths point to your ONNX files
Dependency Version Conflicts
# Verify all versions
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import numpy; print(f'NumPy: {numpy.__version__}')"
python -c "import onnxruntime_genai; print(f'ORT GenAI: {onnxruntime_genai.__version__}')"
# If versions are incorrect, reinstall:
pip install --upgrade --force-reinstall torch== 2.7.0 numpy== 1.26.4
Next Steps
Phi Vision Models Explore Microsoft’s Phi vision models
Qwen Vision Models Learn about Qwen’s advanced capabilities
Deployment Guide Deploy models to production
API Reference Explore the full API documentation