Skip to main content

Pricing Overview

API costs vary significantly by model and provider. Understanding costs helps you choose the right models for your use case.

Cost Per Game (Estimates)

Approximate costs for a single game (25-word board, ~15 turns):

Free Models

$0.00 per gameOpenRouter free tier models:
  • Devstral
  • MIMO V2 Flash
  • Nemotron Nano
  • DeepSeek Chimera variants
  • GLM 4.5 Air
  • Llama 3.3 70B
  • OLMo 3.1 32B

Ultra-Affordable

~$0.001-0.003 per game
  • Gemini 2.5 Flash Lite
  • Gemini 2.0 Flash Lite
  • DeepSeek Chat
  • DeepSeek Reasoner

Cost-Effective

~$0.003-0.02 per game
  • Gemini 2.5 Flash
  • GPT-4o Mini
  • Claude Haiku 4.5
  • GPT-5 Nano
  • GPT-5 Mini

Premium

~$0.05-0.30 per game
  • GPT-5
  • Claude Sonnet 4.5
  • Claude Opus 4.1
  • Gemini 2.5 Pro
  • O-series models

Model Pricing Data

Pricing information is stored in config.py under LLMConfig.MODEL_COSTS:

Verified Pricing

Confirmed from official provider pricing pages:
"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
Prices in USD per 1K tokens.
Pricing for newer models (GPT-5, Claude 4.5, Grok 4) is marked as UNVERIFIED and may be estimated. Check provider documentation for accurate pricing.

Estimating Benchmark Costs

Calculate costs before running large benchmarks:
1

Calculate total games

from model_config import get_benchmark_models

models = get_benchmark_models()
n = len(models)
games_per_combo = 2

# N × (N-1) × (N-2) × (N-3) combinations
combinations = n * (n-1) * (n-2) * (n-3)
total_games = combinations * games_per_combo

print(f"Total games: {total_games}")
Example: 8 models × 2 games = 3,360 games
2

Estimate tokens per game

Typical game (~15 turns):
  • Input tokens: ~3,000-5,000
  • Output tokens: ~500-1,000
avg_input_tokens = 4000
avg_output_tokens = 750
3

Calculate cost per game

from config import LLMConfig

costs = LLMConfig.MODEL_COSTS

model_name = "gpt-4o-mini"
pricing = costs[model_name]

input_cost = (avg_input_tokens / 1000) * pricing["input"]
output_cost = (avg_output_tokens / 1000) * pricing["output"]
cost_per_game = input_cost + output_cost

print(f"Cost per game: ${cost_per_game:.4f}")
4

Calculate total cost

# Each game uses 4 models (one per role)
total_cost = total_games * cost_per_game * 4

print(f"Estimated total: ${total_cost:.2f}")
Example: 3,360 games × 0.0008×4= 0.0008 × 4 = ~10.75

Cost Optimization Strategies

1. Start with Free Models

# model_config.py
def get_benchmark_models() -> list:
    return [
        # All free via OpenRouter!
        BAMLModel.OPENROUTER_DEVSTRAL,
        BAMLModel.OPENROUTER_MIMO_V2_FLASH,
        BAMLModel.OPENROUTER_NEMOTRON_NANO,
        BAMLModel.OPENROUTER_LLAMA_33_70B,
    ]
Perfect for testing benchmark setup, validating pipelines, and preliminary experiments at zero cost.

2. Mix Free and Paid Models

def get_benchmark_models() -> list:
    return [
        # Free baseline
        BAMLModel.OPENROUTER_DEVSTRAL,
        # Cost-effective
        BAMLModel.DEEPSEEK_CHAT,
        BAMLModel.GEMINI_25_FLASH,
        # Premium comparison
        BAMLModel.GPT5_MINI,
    ]

3. Reduce Games Per Combination

# benchmark.py
GAMES_PER_COMBINATION = 1  # Quick validation
# vs
GAMES_PER_COMBINATION = 5  # Production benchmark
Cost scales linearly with games per combination.

4. Target Specific Comparisons

Instead of all combinations, test specific matchups:
test_combinations = [
    # Compare GPT-5 vs Claude head-to-head
    (BAMLModel.GPT5_MINI, BAMLModel.GPT5_MINI,
     BAMLModel.CLAUDE_SONNET_45, BAMLModel.CLAUDE_SONNET_45),
    # Test mixed teams
    (BAMLModel.GPT5_MINI, BAMLModel.DEEPSEEK_CHAT,
     BAMLModel.CLAUDE_HAIKU_45, BAMLModel.GEMINI_25_FLASH),
]

total_games = len(test_combinations) * GAMES_PER_COMBINATION

5. Use Mini/Lite Variants

# Instead of expensive models
BAMLModel.GPT5          # ~$0.10/game
BAMLModel.CLAUDE_OPUS_41  # ~$0.30/game

# Use affordable alternatives
BAMLModel.GPT5_MINI     # ~$0.02/game  
BAMLModel.CLAUDE_HAIKU_45  # ~$0.01/game

Provider Spending Limits

Set spending limits in provider dashboards to prevent unexpected charges:

OpenAI

  1. Go to platform.openai.com/settings/organization/billing/limits
  2. Set monthly budget limit
  3. Configure email alerts

Anthropic

  1. Go to console.anthropic.com/settings/limits
  2. Set spending limits
  3. Enable notifications

Google

  1. Go to console.cloud.google.com/billing
  2. Create budget alert
  3. Set spending threshold

OpenRouter

  1. Go to openrouter.ai/settings
  2. Set credit limit
  3. Use free models when possible

Cost Tracking

During Benchmarks

The benchmark tracks model usage:
result = runner.run()

# Access per-model statistics
for model_key, metrics in result.model_performance.items():
    games = metrics['games_played']
    turns = metrics['turns_played']
    print(f"{model_key}: {games} games, {turns} turns")

Post-Benchmark Analysis

Estimate costs from benchmark results:
import json
from config import LLMConfig

# Load results
with open('benchmark_results/benchmark_20260305.json') as f:
    data = json.load(f)

# Calculate costs
costs = LLMConfig.MODEL_COSTS
total_cost = 0

for game in data['games']:
    for role in ['blue_hint_giver_model', 'blue_guesser_model',
                 'red_hint_giver_model', 'red_guesser_model']:
        model = game[role]
        # Estimate based on turns and model pricing
        # (Actual calculation would need token counts)
        
print(f"Estimated total cost: ${total_cost:.2f}")

Price-Performance Comparison

Best value models (quality per dollar):
OpenRouter Models
  • Llama 3.3 70B: Large, capable, $0.00
  • DeepSeek R1T2 Chimera: Reasoning, $0.00
  • Devstral: Fast Mistral, $0.00
Perfect for unlimited experimentation.

Special Pricing Features

GPT-5.2 Prompt Caching

GPT-5.2 models offer 90% discount on cached inputs:
# Regular pricing
GPT5: $1.75/1M input, $14/1M output

# Cached input (repeated prompts)
GPT5: $0.175/1M input, $14/1M output
Benefits benchmarks with consistent system prompts.

Claude Context Caching

Claude 4.5 models support prompt caching:
# Cache the Codenames rules and board state
# Reduces costs for multi-turn games
Check provider documentation for implementation.

Cost Management Best Practices

# 1. Test with free models
python benchmark.py  # Uses OpenRouter free tier

# 2. Verify pipeline works
python analyze_benchmark_results.py ...

# 3. Then run with paid models
Configure limits at provider level:
  • Daily limit: Prevents runaway costs
  • Monthly budget: Tracks spending
  • Email alerts: Get notified at 50%, 80%, 100%
Check provider dashboards periodically:
  • OpenAI: Usage tab shows real-time costs
  • Anthropic: Console shows current spend
  • Track against estimates
Development workflow:
# Development: Fast, cheap
BAMLModel.GPT5_NANO
BAMLModel.GEMINI_25_FLASH_LITE

# Production: Full quality
BAMLModel.GPT5_MINI
BAMLModel.GEMINI_25_FLASH

Emergency Cost Control

If costs spiral unexpectedly:
1

Stop running benchmarks

# Ctrl+C to interrupt
# Or kill the process
pkill -f benchmark.py
2

Revoke API keys temporarily

Go to provider dashboards and disable keys until you investigate.
3

Review usage

Check provider usage dashboards:
  • Which models used most tokens?
  • Were there repeated errors causing retries?
  • Did games run longer than expected?
4

Adjust configuration

# Reduce limits
config.game.MAX_TURNS = 20  # Shorter games
config.llm.MAX_RETRIES = 1  # Fewer retries

# Use cheaper models
BENCHMARK_MODELS = [BAMLModel.OPENROUTER_DEVSTRAL]

Cost Reporting

Generate cost reports from benchmarks:
from config import LLMConfig

def estimate_benchmark_cost(result_file):
    with open(result_file) as f:
        data = json.load(f)
    
    costs = LLMConfig.MODEL_COSTS
    total = 0
    
    for model_key, metrics in data['model_performance'].items():
        model_name = metrics['model']
        games = metrics['games_played']
        
        # Estimate tokens (simplified)
        est_input = games * 4000
        est_output = games * 750
        
        if model_name in costs:
            pricing = costs[model_name]
            cost = (est_input/1000 * pricing['input'] + 
                   est_output/1000 * pricing['output'])
            total += cost
            
            print(f"{model_key}: ${cost:.2f}")
    
    print(f"\nTotal estimated: ${total:.2f}")

estimate_benchmark_cost('benchmark_results/benchmark_20260305.json')

Next Steps

Model Selection

Compare model capabilities and pricing

Benchmarking

Run cost-effective benchmarks

Configuration

Optimize settings for cost control

Running Games

Test models before committing to benchmarks

Build docs developers (and LLMs) love