Pricing Overview
API costs vary significantly by model and provider. Understanding costs helps you choose the right models for your use case.Cost Per Game (Estimates)
Approximate costs for a single game (25-word board, ~15 turns):Free Models
$0.00 per gameOpenRouter free tier models:
- Devstral
- MIMO V2 Flash
- Nemotron Nano
- DeepSeek Chimera variants
- GLM 4.5 Air
- Llama 3.3 70B
- OLMo 3.1 32B
Ultra-Affordable
~$0.001-0.003 per game
- Gemini 2.5 Flash Lite
- Gemini 2.0 Flash Lite
- DeepSeek Chat
- DeepSeek Reasoner
Cost-Effective
~$0.003-0.02 per game
- Gemini 2.5 Flash
- GPT-4o Mini
- Claude Haiku 4.5
- GPT-5 Nano
- GPT-5 Mini
Premium
~$0.05-0.30 per game
- GPT-5
- Claude Sonnet 4.5
- Claude Opus 4.1
- Gemini 2.5 Pro
- O-series models
Model Pricing Data
Pricing information is stored inconfig.py under LLMConfig.MODEL_COSTS:
Verified Pricing
Confirmed from official provider pricing pages:- OpenAI
- Anthropic
- Google
- DeepSeek
Estimating Benchmark Costs
Calculate costs before running large benchmarks:Estimate tokens per game
Typical game (~15 turns):
- Input tokens: ~3,000-5,000
- Output tokens: ~500-1,000
Cost Optimization Strategies
1. Start with Free Models
Perfect for testing benchmark setup, validating pipelines, and preliminary experiments at zero cost.
2. Mix Free and Paid Models
3. Reduce Games Per Combination
4. Target Specific Comparisons
Instead of all combinations, test specific matchups:5. Use Mini/Lite Variants
Provider Spending Limits
Set spending limits in provider dashboards to prevent unexpected charges:OpenAI
- Go to platform.openai.com/settings/organization/billing/limits
- Set monthly budget limit
- Configure email alerts
Anthropic
- Go to console.anthropic.com/settings/limits
- Set spending limits
- Enable notifications
- Go to console.cloud.google.com/billing
- Create budget alert
- Set spending threshold
OpenRouter
- Go to openrouter.ai/settings
- Set credit limit
- Use free models when possible
Cost Tracking
During Benchmarks
The benchmark tracks model usage:Post-Benchmark Analysis
Estimate costs from benchmark results:Price-Performance Comparison
Best value models (quality per dollar):- Best Free
- Best Budget
OpenRouter Models
- Llama 3.3 70B: Large, capable, $0.00
- DeepSeek R1T2 Chimera: Reasoning, $0.00
- Devstral: Fast Mistral, $0.00
Special Pricing Features
GPT-5.2 Prompt Caching
GPT-5.2 models offer 90% discount on cached inputs:Claude Context Caching
Claude 4.5 models support prompt caching:Cost Management Best Practices
Always validate with free models first
Always validate with free models first
Set hard spending limits
Set hard spending limits
Configure limits at provider level:
- Daily limit: Prevents runaway costs
- Monthly budget: Tracks spending
- Email alerts: Get notified at 50%, 80%, 100%
Monitor costs during runs
Monitor costs during runs
Check provider dashboards periodically:
- OpenAI: Usage tab shows real-time costs
- Anthropic: Console shows current spend
- Track against estimates
Use mini variants for development
Use mini variants for development
Development workflow:
Emergency Cost Control
If costs spiral unexpectedly:Review usage
Check provider usage dashboards:
- Which models used most tokens?
- Were there repeated errors causing retries?
- Did games run longer than expected?
Cost Reporting
Generate cost reports from benchmarks:Next Steps
Model Selection
Compare model capabilities and pricing
Benchmarking
Run cost-effective benchmarks
Configuration
Optimize settings for cost control
Running Games
Test models before committing to benchmarks