Cost Management - Codenames AI Benchmark

Pricing Overview

API costs vary significantly by model and provider. Understanding costs helps you choose the right models for your use case.

Cost Per Game (Estimates)

Approximate costs for a single game (25-word board, ~15 turns):

Free Models

$0.00 per gameOpenRouter free tier models:

Devstral
MIMO V2 Flash
Nemotron Nano
DeepSeek Chimera variants
GLM 4.5 Air
Llama 3.3 70B
OLMo 3.1 32B

Ultra-Affordable

~$0.001-0.003 per game

Gemini 2.5 Flash Lite
Gemini 2.0 Flash Lite
DeepSeek Chat
DeepSeek Reasoner

Cost-Effective

~$0.003-0.02 per game

Gemini 2.5 Flash
GPT-4o Mini
Claude Haiku 4.5
GPT-5 Nano
GPT-5 Mini

Premium

~$0.05-0.30 per game

GPT-5
Claude Sonnet 4.5
Claude Opus 4.1
Gemini 2.5 Pro
O-series models

Model Pricing Data

Pricing information is stored in config.py under LLMConfig.MODEL_COSTS:

Verified Pricing

Confirmed from official provider pricing pages:

OpenAI
Anthropic
Google
DeepSeek

"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},

Prices in USD per 1K tokens.

"claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015},
"claude-3-5-haiku-20241022": {"input": 0.0008, "output": 0.004},
"claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125},

"gemini-2.0-flash": {"input": 0.0001, "output": 0.0004},
"gemini-2.0-flash-lite": {"input": 0.000075, "output": 0.0003},

"deepseek-chat": {"input": 0.00027, "output": 0.0011},
"deepseek-reasoner": {"input": 0.00055, "output": 0.00219},

Pricing for newer models (GPT-5, Claude 4.5, Grok 4) is marked as UNVERIFIED and may be estimated. Check provider documentation for accurate pricing.

Estimating Benchmark Costs

Calculate costs before running large benchmarks:

Calculate total games

from model_config import get_benchmark_models

models = get_benchmark_models()
n = len(models)
games_per_combo = 2

# N × (N-1) × (N-2) × (N-3) combinations
combinations = n * (n-1) * (n-2) * (n-3)
total_games = combinations * games_per_combo

print(f"Total games: {total_games}")

Example: 8 models × 2 games = 3,360 games

Estimate tokens per game

Typical game (~15 turns):

Input tokens: ~3,000-5,000
Output tokens: ~500-1,000

avg_input_tokens = 4000
avg_output_tokens = 750

Calculate cost per game

from config import LLMConfig

costs = LLMConfig.MODEL_COSTS

model_name = "gpt-4o-mini"
pricing = costs[model_name]

input_cost = (avg_input_tokens / 1000) * pricing["input"]
output_cost = (avg_output_tokens / 1000) * pricing["output"]
cost_per_game = input_cost + output_cost

print(f"Cost per game: ${cost_per_game:.4f}")

Calculate total cost

# Each game uses 4 models (one per role)
total_cost = total_games * cost_per_game * 4

print(f"Estimated total: ${total_cost:.2f}")

Example: 3,360 games ×

0.0008 × 4 = ~

10.75

Cost Optimization Strategies

1. Start with Free Models

# model_config.py
def get_benchmark_models() -> list:
    return [
        # All free via OpenRouter!
        BAMLModel.OPENROUTER_DEVSTRAL,
        BAMLModel.OPENROUTER_MIMO_V2_FLASH,
        BAMLModel.OPENROUTER_NEMOTRON_NANO,
        BAMLModel.OPENROUTER_LLAMA_33_70B,
    ]

Perfect for testing benchmark setup, validating pipelines, and preliminary experiments at zero cost.

2. Mix Free and Paid Models

def get_benchmark_models() -> list:
    return [
        # Free baseline
        BAMLModel.OPENROUTER_DEVSTRAL,
        # Cost-effective
        BAMLModel.DEEPSEEK_CHAT,
        BAMLModel.GEMINI_25_FLASH,
        # Premium comparison
        BAMLModel.GPT5_MINI,
    ]

3. Reduce Games Per Combination

# benchmark.py
GAMES_PER_COMBINATION = 1  # Quick validation
# vs
GAMES_PER_COMBINATION = 5  # Production benchmark

Cost scales linearly with games per combination.

4. Target Specific Comparisons

Instead of all combinations, test specific matchups:

test_combinations = [
    # Compare GPT-5 vs Claude head-to-head
    (BAMLModel.GPT5_MINI, BAMLModel.GPT5_MINI,
     BAMLModel.CLAUDE_SONNET_45, BAMLModel.CLAUDE_SONNET_45),
    # Test mixed teams
    (BAMLModel.GPT5_MINI, BAMLModel.DEEPSEEK_CHAT,
     BAMLModel.CLAUDE_HAIKU_45, BAMLModel.GEMINI_25_FLASH),
]

total_games = len(test_combinations) * GAMES_PER_COMBINATION

5. Use Mini/Lite Variants

# Instead of expensive models
BAMLModel.GPT5          # ~$0.10/game
BAMLModel.CLAUDE_OPUS_41  # ~$0.30/game

# Use affordable alternatives
BAMLModel.GPT5_MINI     # ~$0.02/game  
BAMLModel.CLAUDE_HAIKU_45  # ~$0.01/game

Provider Spending Limits

Set spending limits in provider dashboards to prevent unexpected charges:

OpenAI

Go to platform.openai.com/settings/organization/billing/limits
Set monthly budget limit
Configure email alerts

Anthropic

Go to console.anthropic.com/settings/limits
Set spending limits
Enable notifications

Google

Go to console.cloud.google.com/billing
Create budget alert
Set spending threshold

OpenRouter

Go to openrouter.ai/settings
Set credit limit
Use free models when possible

Cost Tracking

During Benchmarks

The benchmark tracks model usage:

result = runner.run()

# Access per-model statistics
for model_key, metrics in result.model_performance.items():
    games = metrics['games_played']
    turns = metrics['turns_played']
    print(f"{model_key}: {games} games, {turns} turns")

Post-Benchmark Analysis

Estimate costs from benchmark results:

import json
from config import LLMConfig

# Load results
with open('benchmark_results/benchmark_20260305.json') as f:
    data = json.load(f)

# Calculate costs
costs = LLMConfig.MODEL_COSTS
total_cost = 0

for game in data['games']:
    for role in ['blue_hint_giver_model', 'blue_guesser_model',
                 'red_hint_giver_model', 'red_guesser_model']:
        model = game[role]
        # Estimate based on turns and model pricing
        # (Actual calculation would need token counts)
        
print(f"Estimated total cost: ${total_cost:.2f}")

Price-Performance Comparison

Best value models (quality per dollar):

Best Free
Best Budget
Best Premium

OpenRouter Models

Llama 3.3 70B: Large, capable, $0.00
DeepSeek R1T2 Chimera: Reasoning, $0.00
Devstral: Fast Mistral, $0.00

Perfect for unlimited experimentation.

Special Pricing Features

GPT-5.2 Prompt Caching

GPT-5.2 models offer 90% discount on cached inputs:

# Regular pricing
GPT5: $1.75/1M input, $14/1M output

# Cached input (repeated prompts)
GPT5: $0.175/1M input, $14/1M output

Benefits benchmarks with consistent system prompts.

Claude Context Caching

Claude 4.5 models support prompt caching:

# Cache the Codenames rules and board state
# Reduces costs for multi-turn games

Check provider documentation for implementation.

Cost Management Best Practices

Always validate with free models first

# 1. Test with free models
python benchmark.py  # Uses OpenRouter free tier

# 2. Verify pipeline works
python analyze_benchmark_results.py ...

# 3. Then run with paid models

Set hard spending limits

Configure limits at provider level:

Daily limit: Prevents runaway costs
Monthly budget: Tracks spending
Email alerts: Get notified at 50%, 80%, 100%

Monitor costs during runs

Check provider dashboards periodically:

OpenAI: Usage tab shows real-time costs
Anthropic: Console shows current spend
Track against estimates

Use mini variants for development

Development workflow:

# Development: Fast, cheap
BAMLModel.GPT5_NANO
BAMLModel.GEMINI_25_FLASH_LITE

# Production: Full quality
BAMLModel.GPT5_MINI
BAMLModel.GEMINI_25_FLASH

Emergency Cost Control

If costs spiral unexpectedly:

Stop running benchmarks

# Ctrl+C to interrupt
# Or kill the process
pkill -f benchmark.py

Revoke API keys temporarily

Go to provider dashboards and disable keys until you investigate.

Review usage

Check provider usage dashboards:

Which models used most tokens?
Were there repeated errors causing retries?
Did games run longer than expected?

Adjust configuration

# Reduce limits
config.game.MAX_TURNS = 20  # Shorter games
config.llm.MAX_RETRIES = 1  # Fewer retries

# Use cheaper models
BENCHMARK_MODELS = [BAMLModel.OPENROUTER_DEVSTRAL]

Cost Reporting

Generate cost reports from benchmarks:

from config import LLMConfig

def estimate_benchmark_cost(result_file):
    with open(result_file) as f:
        data = json.load(f)
    
    costs = LLMConfig.MODEL_COSTS
    total = 0
    
    for model_key, metrics in data['model_performance'].items():
        model_name = metrics['model']
        games = metrics['games_played']
        
        # Estimate tokens (simplified)
        est_input = games * 4000
        est_output = games * 750
        
        if model_name in costs:
            pricing = costs[model_name]
            cost = (est_input/1000 * pricing['input'] + 
                   est_output/1000 * pricing['output'])
            total += cost
            
            print(f"{model_key}: ${cost:.2f}")
    
    print(f"\nTotal estimated: ${total:.2f}")

estimate_benchmark_cost('benchmark_results/benchmark_20260305.json')

Next Steps

Model Selection

Compare model capabilities and pricing

Benchmarking

Run cost-effective benchmarks

Configuration

Optimize settings for cost control

Running Games

Test models before committing to benchmarks

Get Started

Core Concepts

Guides

Advanced

​Pricing Overview

​Cost Per Game (Estimates)

Free Models

Ultra-Affordable

Cost-Effective

Premium

​Model Pricing Data

​Verified Pricing

​Estimating Benchmark Costs

​Cost Optimization Strategies

​1. Start with Free Models

​2. Mix Free and Paid Models

​3. Reduce Games Per Combination

​4. Target Specific Comparisons

​5. Use Mini/Lite Variants

​Provider Spending Limits

OpenAI

Anthropic

Google

OpenRouter

​Cost Tracking

​During Benchmarks

​Post-Benchmark Analysis

​Price-Performance Comparison

​Special Pricing Features

​GPT-5.2 Prompt Caching

​Claude Context Caching

​Cost Management Best Practices

​Emergency Cost Control

​Cost Reporting

​Next Steps

Model Selection

Benchmarking

Configuration

Running Games

Build docs developers (and LLMs) love

Pricing Overview

Cost Per Game (Estimates)

Model Pricing Data

Verified Pricing

Estimating Benchmark Costs

Cost Optimization Strategies

1. Start with Free Models

2. Mix Free and Paid Models

3. Reduce Games Per Combination

4. Target Specific Comparisons

5. Use Mini/Lite Variants

Provider Spending Limits

Cost Tracking

During Benchmarks

Post-Benchmark Analysis

Price-Performance Comparison

Special Pricing Features

GPT-5.2 Prompt Caching

Claude Context Caching

Cost Management Best Practices

Emergency Cost Control

Cost Reporting

Next Steps