Overview
The benchmark suite tests model combinations by running multiple games and collecting performance metrics. Each model plays at most one role per game to ensure fair evaluation.
Quick Start
Run the benchmark
The script will ask for confirmation before starting.
Wait for completion
The benchmark runs multiple games testing different model combinations. Progress is shown in real-time.
Analyze results
python analyze_benchmark_results.py benchmark_results/ < result_fil e > .json
This generates insights, statistics, and visualizations.
Configuring Benchmarks
Select Models to Test
Edit model_config.py to choose which models are benchmarked:
def get_benchmark_models () -> list :
"""Get the list of models for benchmarking."""
return [
BAMLModel. OPENROUTER_DEVSTRAL ,
BAMLModel. OPENROUTER_MIMO_V2_FLASH ,
BAMLModel. OPENROUTER_NEMOTRON_NANO ,
BAMLModel. OPENROUTER_DEEPSEEK_R1T_CHIMERA ,
BAMLModel. OPENROUTER_DEEPSEEK_R1T2_CHIMERA ,
BAMLModel. OPENROUTER_GLM_45_AIR ,
BAMLModel. OPENROUTER_LLAMA_33_70B ,
BAMLModel. OPENROUTER_OLMO3_32B ,
]
The default configuration uses free OpenRouter models for zero-cost experimentation.
Adjust Games Per Combination
Modify GAMES_PER_COMBINATION in benchmark.py:
# Benchmark settings
GAMES_PER_COMBINATION = 2 # Increase for more statistical significance
OUTPUT_DIR = "benchmark_results"
VERBOSE = True
How Combinations Work
The benchmark generates combinations where each model plays at most one role per game:
# Each model can be:
# - Blue team hint giver
# - Blue team guesser
# - Red team hint giver
# - Red team guesser
# Constraints ensure fair evaluation:
for blue_hint in BENCHMARK_MODELS :
for blue_guess in BENCHMARK_MODELS :
if blue_guess != blue_hint: # Blue team uses different models
for red_hint in BENCHMARK_MODELS :
if red_hint not in [blue_hint, blue_guess]: # No overlap with blue
for red_guess in BENCHMARK_MODELS :
if red_guess != red_hint and \
red_guess not in [blue_hint, blue_guess]: # All unique
# Valid combination!
test_combinations.append( ... )
With N models and 4 roles, this produces N × (N-1) × (N-2) × (N-3) combinations. Example: 8 models = 8 × 7 × 6 × 5 = 1,680 combinations
Metrics Tracked
The benchmark collects comprehensive statistics:
Per-Model Metrics
Win Rate Percentage of games won when playing for a specific team
Hint Success Rate How often hints lead to at least one correct guess
Guess Accuracy Percentage of guesses that were correct
First Guess Accuracy Success rate of the first guess each turn
Per-Combination Metrics
Games played : Total games for this combination
Win rates : Blue wins, red wins, draws
Average turns : Turns needed to complete games
Error rate : Frequency of errors or failures
Advanced Statistics
# Per model, per role, per team
metrics = {
'games_played' : 0 , # Total games in this role
'wins' : 0 , # Games won
'turns_played' : 0 , # Total turns taken
'hints_given' : 0 , # Total hints (hint givers only)
'successful_hints' : 0 , # Hints with ≥1 correct guess
'guesses_made' : 0 , # Total guesses attempted
'correct_guesses' : 0 , # Correct guesses
'wrong_guesses' : 0 , # Incorrect guesses
'bomb_hits' : 0 , # Bomb hits (instant loss)
'invalid_offboard' : 0 , # Guesses not on board
'invalid_revealed' : 0 , # Already revealed words
}
Benchmark Results
File Structure
benchmark_results/
├── benchmark_20260305_143022.json # Raw benchmark data
└── analysis_plots/ # Generated visualizations
├── model_win_rates.png
├── hint_success_rates.png
├── guess_accuracy.png
└── turn_efficiency.png
The JSON file contains:
{
"benchmark_id" : "benchmark_20260305_143022" ,
"timestamp" : "2026-03-05T14:30:22.123456" ,
"total_combinations" : 1680 ,
"total_games" : 3360 ,
"model_performance" : {
"OpenRouterDevstral_hint_giver_blue" : {
"model" : "OpenRouterDevstral" ,
"role" : "hint_giver" ,
"team" : "blue" ,
"games_played" : 420 ,
"wins" : 245 ,
"hints_given" : 1834 ,
"successful_hints" : 1567
}
},
"team_combinations" : { ... },
"games" : [ ... ]
}
Analysis Pipeline
The analysis script processes results through a modular pipeline:
Load data
Read the benchmark JSON file
Calculate metrics
Compute win rates, accuracy, efficiency
Generate insights
Identify patterns and top performers
Create visualizations
Generate charts and plots
Write report
Save markdown report with findings
Running Analysis
python analyze_benchmark_results.py benchmark_results/benchmark_20260305_143022.json
Generates full analysis with visualizations. python analyze_benchmark_results.py benchmark_results/benchmark_20260305_143022.json --no-plots
Generates insights report without creating plot files.
Programmatic Benchmarking
Run benchmarks programmatically for custom workflows:
from benchmark import BenchmarkRunner
from agents.llm import BAMLModel
# Create runner
runner = BenchmarkRunner(
games_per_combination = 3 ,
verbose = True
)
# Run benchmark
result = runner.run()
# Save results
main_file = result.save( "my_benchmark_results" )
print ( f "Results saved to: { main_file } " )
# Access data
for model_key, metrics in result.model_performance.items():
if metrics[ 'games_played' ] > 0 :
win_rate = metrics[ 'wins' ] / metrics[ 'games_played' ]
print ( f " { model_key } : { win_rate :.1%} win rate" )
Best Practices
Begin with 1-2 games per combination to verify setup: GAMES_PER_COMBINATION = 1 # Quick validation run
Then increase for production benchmarks: GAMES_PER_COMBINATION = 5 # Better statistical significance
Test your benchmark setup with free OpenRouter models before running expensive models: BENCHMARK_MODELS = [
BAMLModel. OPENROUTER_DEVSTRAL ,
BAMLModel. OPENROUTER_MIMO_V2_FLASH ,
]
Track costs during benchmarks:
Set spending limits in provider dashboards
Use cost estimation before running
Monitor usage during execution
See Cost Management for details.
Check for temperature restrictions
Some models require specific temperature settings: # The benchmark validates models automatically
runner._validate_models() # Warns about restrictions
# o-series models require temperature=1.0
BAMLModel.O1, BAMLModel.O3, BAMLModel. O4_MINI
Interpreting Results
Key metrics to evaluate:
Win rate > 60% : Strong performer
Hint success > 70% : Effective spymaster
Guess accuracy > 60% : Reliable operative
First guess accuracy > 50% : Good hint interpretation
Team Synergies
Some model combinations work better together:
Top combination: GPT-5 Mini + GPT-5 Mini vs Claude Sonnet 4.5 + Claude Sonnet 4.5
Blue Win Rate: 73.2% (41/56 games)
Avg Turns: 12.4
Look for patterns in successful pairings.
Troubleshooting
Rate limits : If you hit rate limits, the benchmark will retry with exponential backoff. Consider:
Reducing GAMES_PER_COMBINATION
Using fewer models
Spreading runs across multiple days
Long runtime : Large benchmarks take time. Example:
8 models × 2 games = ~6-8 hours
Use VERBOSE=True to monitor progress
Results are saved incrementally
Next Steps
Model Selection Choose the best models for your use case
Cost Management Estimate and optimize benchmark costs