Model Optimizer for Cost-Efficient AI

Overview

Model Optimizer intelligently routes user queries across different Gemini models (Pro, Flash, etc.) based on query complexity and your cost-quality preferences. This enables significant cost savings while maintaining response quality.

Model Optimizer analyzes each query and automatically selects the most appropriate model from the Gemini family, eliminating the need for manual model selection.

How It Works

Model Optimizer evaluates queries in real-time and routes them to the optimal model:

Simple queries → Gemini Flash (fast, cost-effective)
Complex reasoning → Gemini Pro (high quality)
Balanced needs → Intelligent routing based on your preference

Getting Started

Installation

pip install --upgrade google-genai

Basic Setup

Import Libraries

import os
from google import genai
from google.genai.types import (
    FeatureSelectionPreference,
    GenerateContentConfig,
    ModelSelectionConfig,
)

PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")
LOCATION = "us-central1"

client = genai.Client(
    vertexai=True,
    project=PROJECT_ID,
    location=LOCATION
)

Configure Model Selection

Choose your routing preference:

# Available preferences:
# - PRIORITIZE_QUALITY: Best possible responses
# - BALANCED: Balance quality and cost  
# - PRIORITIZE_COST: Minimize costs

model_selection_config = ModelSelectionConfig(
    feature_selection_preference=FeatureSelectionPreference.BALANCED
)

Use Model Optimizer

MODEL_ID = "model-optimizer-exp-04-09"

response = client.models.generate_content(
    model=MODEL_ID,
    contents="Explain quantum entanglement",
    config=GenerateContentConfig(
        model_selection_config=model_selection_config,
    ),
)

print(response.text)

Routing Strategies

Prioritize Quality

Routes to higher-capability models for best responses. Use for critical applications.

Balanced

Optimizes for both quality and cost. Recommended for most use cases.

Prioritize Cost

Minimizes costs while maintaining acceptable quality. Use for high-volume applications.

Comparing Routing Strategies

from IPython.display import Markdown, display

test_queries = [
    "What is 2+2?",
    "Explain the implications of quantum computing on cryptography",
    "Translate 'hello' to Spanish",
    "Analyze the economic factors that led to the 2008 financial crisis",
]

strategies = [
    FeatureSelectionPreference.PRIORITIZE_QUALITY,
    FeatureSelectionPreference.BALANCED,
    FeatureSelectionPreference.PRIORITIZE_COST,
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print(f"{'='*60}\n")
    
    for strategy in strategies:
        config = ModelSelectionConfig(
            feature_selection_preference=strategy
        )
        
        response = client.models.generate_content(
            model="model-optimizer-exp-04-09",
            contents=query,
            config=GenerateContentConfig(
                model_selection_config=config,
            ),
        )
        
        print(f"Strategy: {strategy.name}")
        print(f"Response length: {len(response.text)} chars")
        print(f"---\n")

Advanced Features

Streaming Responses

For real-time applications, use streaming with Model Optimizer:

def generate_content_streaming(prompt: str, preference: FeatureSelectionPreference):
    """Generate content with streaming and display progressively."""
    from IPython.display import display, Markdown
    
    output_text = ""
    markdown_display = display(Markdown(output_text), display_id=True)
    
    model_config = ModelSelectionConfig(
        feature_selection_preference=preference
    )
    
    for chunk in client.models.generate_content_stream(
        model="model-optimizer-exp-04-09",
        contents=prompt,
        config=GenerateContentConfig(
            model_selection_config=model_config,
        ),
    ):
        output_text += chunk.text
        markdown_display.update(Markdown(output_text))
    
    return output_text

# Use streaming
result = generate_content_streaming(
    "Write a detailed explanation of machine learning",
    FeatureSelectionPreference.BALANCED
)

Multi-Turn Conversations

Model Optimizer maintains context across conversation turns:

# Multi-turn conversation
conversation = [
    "What is x multiplied by 2?",
    "x = 42",
]

response = client.models.generate_content(
    model="model-optimizer-exp-04-09",
    contents=conversation,
    config=GenerateContentConfig(
        model_selection_config=ModelSelectionConfig(
            feature_selection_preference=FeatureSelectionPreference.BALANCED
        ),
    ),
)

print(response.text)  # Output: 84

Function Calling

Combine Model Optimizer with function calling for agentic workflows:

from google.genai.types import FunctionDeclaration, Tool

# Define function
def get_current_weather(location: str, unit: str = "celsius") -> dict:
    """Gets weather in the specified location.
    
    Args:
        location: The location for which to get the weather.
        unit: Temperature unit (celsius or fahrenheit).
    
    Returns:
        Weather information as a dictionary.
    """
    return {
        "location": location,
        "unit": unit,
        "weather": "Sunny",
        "temperature": 22,
    }

# Define function schema
weather_function = FunctionDeclaration(
    name="get_current_weather",
    description="Get the current weather in a given location",
    parameters={
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA",
            },
            "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
            },
        },
        "required": ["location"],
    },
    response={
        "type": "object",
        "properties": {
            "location": {"type": "string"},
            "unit": {"type": "string"},
            "weather": {"type": "string"},
            "temperature": {"type": "number"},
        },
    },
)

weather_tool = Tool(function_declarations=[weather_function])

# Use with Model Optimizer
response = client.models.generate_content(
    model="model-optimizer-exp-04-09",
    contents="What is the weather like in Boston?",
    config=GenerateContentConfig(
        tools=[weather_tool],
        model_selection_config=ModelSelectionConfig(
            feature_selection_preference=FeatureSelectionPreference.BALANCED
        ),
    ),
)

print("Function call:", response.function_calls)

# Execute function
function_map = {"get_current_weather": get_current_weather}

for function_call in response.function_calls:
    func = function_map[function_call.name]
    result = func(**function_call.args)
    print(f"Result: {result}")

System Instructions

Combine Model Optimizer with system instructions for consistent behavior:

system_instruction = """
You are a helpful financial advisor assistant.
Always provide clear, actionable advice.
Cite sources when making specific claims.
"""

response = client.models.generate_content(
    model="model-optimizer-exp-04-09",
    contents="Should I invest in index funds or individual stocks?",
    config=GenerateContentConfig(
        system_instruction=system_instruction,
        model_selection_config=ModelSelectionConfig(
            feature_selection_preference=FeatureSelectionPreference.PRIORITIZE_QUALITY
        ),
    ),
)

print(response.text)

Cost Optimization Patterns

Batch Processing

Process multiple queries efficiently:

import asyncio
from typing import List, Dict

async def batch_process_queries(
    queries: List[str],
    preference: FeatureSelectionPreference
) -> List[Dict]:
    """Process multiple queries in parallel."""
    async def process_single(query: str) -> Dict:
        response = client.models.generate_content(
            model="model-optimizer-exp-04-09",
            contents=query,
            config=GenerateContentConfig(
                model_selection_config=ModelSelectionConfig(
                    feature_selection_preference=preference
                ),
            ),
        )
        return {
            "query": query,
            "response": response.text,
        }
    
    tasks = [process_single(q) for q in queries]
    results = await asyncio.gather(*tasks)
    return results

# Process batch
queries = [
    "What is Python?",
    "Explain neural networks",
    "Define API",
]

results = asyncio.run(
    batch_process_queries(
        queries,
        FeatureSelectionPreference.PRIORITIZE_COST
    )
)

for result in results:
    print(f"Q: {result['query']}")
    print(f"A: {result['response'][:100]}...\n")

Caching Common Queries

from functools import lru_cache
import hashlib

class OptimizedQueryCache:
    def __init__(self):
        self.cache = {}
    
    def _hash_query(self, query: str) -> str:
        return hashlib.md5(query.encode()).hexdigest()
    
    def query(self, prompt: str, preference: FeatureSelectionPreference) -> str:
        """Query with caching for repeated prompts."""
        cache_key = f"{self._hash_query(prompt)}_{preference.name}"
        
        if cache_key in self.cache:
            print("Cache hit!")
            return self.cache[cache_key]
        
        response = client.models.generate_content(
            model="model-optimizer-exp-04-09",
            contents=prompt,
            config=GenerateContentConfig(
                model_selection_config=ModelSelectionConfig(
                    feature_selection_preference=preference
                ),
            ),
        )
        
        self.cache[cache_key] = response.text
        return response.text

# Use cached queries
cache = OptimizedQueryCache()

# First call - hits API
result1 = cache.query("What is AI?", FeatureSelectionPreference.BALANCED)

# Second call - uses cache
result2 = cache.query("What is AI?", FeatureSelectionPreference.BALANCED)

Monitoring and Analytics

Track Model Usage

import time
from collections import defaultdict
from typing import Dict, List

class ModelOptimizerAnalytics:
    def __init__(self):
        self.queries: List[Dict] = []
        self.stats = defaultdict(int)
    
    def query_with_tracking(
        self,
        prompt: str,
        preference: FeatureSelectionPreference
    ) -> Dict:
        """Query and track performance metrics."""
        start_time = time.time()
        
        response = client.models.generate_content(
            model="model-optimizer-exp-04-09",
            contents=prompt,
            config=GenerateContentConfig(
                model_selection_config=ModelSelectionConfig(
                    feature_selection_preference=preference
                ),
            ),
        )
        
        elapsed = time.time() - start_time
        
        query_data = {
            "prompt": prompt,
            "preference": preference.name,
            "response_length": len(response.text),
            "latency_ms": elapsed * 1000,
            "timestamp": time.time(),
        }
        
        self.queries.append(query_data)
        self.stats[preference.name] += 1
        
        return query_data
    
    def get_analytics(self) -> Dict:
        """Get analytics summary."""
        if not self.queries:
            return {}
        
        total_queries = len(self.queries)
        avg_latency = sum(q["latency_ms"] for q in self.queries) / total_queries
        
        return {
            "total_queries": total_queries,
            "avg_latency_ms": avg_latency,
            "queries_by_preference": dict(self.stats),
        }

# Use analytics
analytics = ModelOptimizerAnalytics()

for prompt in ["Hello", "Explain relativity", "What is 1+1?"]:
    analytics.query_with_tracking(
        prompt,
        FeatureSelectionPreference.BALANCED
    )

print(analytics.get_analytics())

Best Practices

Always test different routing strategies with your specific workload to find the optimal balance of quality and cost.

Guidelines

Start with BALANCED - Provides good quality-cost tradeoff for most applications
Use PRIORITIZE_QUALITY for:
- Critical business decisions
- Complex reasoning tasks
- Customer-facing applications where quality is paramount
Use PRIORITIZE_COST for:
- High-volume, simple queries
- Internal tools and automation
- Non-critical batch processing
Monitor and adjust - Track metrics and adjust strategy based on results

Error Handling

from google.api_core import exceptions

def robust_query(prompt: str, max_retries: int = 3) -> str:
    """Query with retry logic and error handling."""
    for attempt in range(max_retries):
        try:
            response = client.models.generate_content(
                model="model-optimizer-exp-04-09",
                contents=prompt,
                config=GenerateContentConfig(
                    model_selection_config=ModelSelectionConfig(
                        feature_selection_preference=FeatureSelectionPreference.BALANCED
                    ),
                ),
            )
            return response.text
        
        except exceptions.ResourceExhausted as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Rate limited. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise
        
        except Exception as e:
            print(f"Error: {e}")
            raise

    return ""

Next Steps

Deploy agents with Agent Engine
Build workflows with Orchestration
Implement Responsible AI practices

Evaluation & Testing

Production Deployment

Open Models

Overview

How It Works

Getting Started

Installation

Basic Setup

Routing Strategies

Prioritize Quality

Balanced

Prioritize Cost

Comparing Routing Strategies

Advanced Features

Streaming Responses

Multi-Turn Conversations

Function Calling

System Instructions

Cost Optimization Patterns

Batch Processing

Caching Common Queries

Monitoring and Analytics

Track Model Usage

Best Practices

Guidelines

Error Handling

Next Steps

Build docs developers (and LLMs) love

Evaluation & Testing

Production Deployment

Open Models

​Overview

​How It Works

​Getting Started

​Installation

​Basic Setup

​Routing Strategies

Prioritize Quality

Balanced

Prioritize Cost

​Comparing Routing Strategies

​Advanced Features

​Streaming Responses

​Multi-Turn Conversations

​Function Calling

​System Instructions

​Cost Optimization Patterns

​Batch Processing

​Caching Common Queries

​Monitoring and Analytics

​Track Model Usage

​Best Practices

​Guidelines

​Error Handling

​Next Steps

Build docs developers (and LLMs) love

Overview

How It Works

Getting Started

Installation

Basic Setup

Routing Strategies

Comparing Routing Strategies

Advanced Features

Streaming Responses

Multi-Turn Conversations

Function Calling

System Instructions

Cost Optimization Patterns

Batch Processing

Caching Common Queries

Monitoring and Analytics

Track Model Usage

Best Practices

Guidelines

Error Handling

Next Steps