Benchmarking

Overview

The benchmark suite tests model combinations by running multiple games and collecting performance metrics. Each model plays at most one role per game to ensure fair evaluation.

Quick Start

Run the benchmark

python benchmark.py

The script will ask for confirmation before starting.

Wait for completion

The benchmark runs multiple games testing different model combinations. Progress is shown in real-time.

Analyze results

python analyze_benchmark_results.py benchmark_results/<result_file>.json

This generates insights, statistics, and visualizations.

Configuring Benchmarks

Select Models to Test

Edit model_config.py to choose which models are benchmarked:

def get_benchmark_models() -> list:
    """Get the list of models for benchmarking."""
    return [
        BAMLModel.OPENROUTER_DEVSTRAL,
        BAMLModel.OPENROUTER_MIMO_V2_FLASH,
        BAMLModel.OPENROUTER_NEMOTRON_NANO,
        BAMLModel.OPENROUTER_DEEPSEEK_R1T_CHIMERA,
        BAMLModel.OPENROUTER_DEEPSEEK_R1T2_CHIMERA,
        BAMLModel.OPENROUTER_GLM_45_AIR,
        BAMLModel.OPENROUTER_LLAMA_33_70B,
        BAMLModel.OPENROUTER_OLMO3_32B,
    ]

The default configuration uses free OpenRouter models for zero-cost experimentation.

Adjust Games Per Combination

Modify GAMES_PER_COMBINATION in benchmark.py:

# Benchmark settings
GAMES_PER_COMBINATION = 2  # Increase for more statistical significance
OUTPUT_DIR = "benchmark_results"
VERBOSE = True

How Combinations Work

The benchmark generates combinations where each model plays at most one role per game:

# Each model can be:
# - Blue team hint giver
# - Blue team guesser
# - Red team hint giver
# - Red team guesser

# Constraints ensure fair evaluation:
for blue_hint in BENCHMARK_MODELS:
    for blue_guess in BENCHMARK_MODELS:
        if blue_guess != blue_hint:  # Blue team uses different models
            for red_hint in BENCHMARK_MODELS:
                if red_hint not in [blue_hint, blue_guess]:  # No overlap with blue
                    for red_guess in BENCHMARK_MODELS:
                        if red_guess != red_hint and \
                           red_guess not in [blue_hint, blue_guess]:  # All unique
                            # Valid combination!
                            test_combinations.append(...)

With N models and 4 roles, this produces N × (N-1) × (N-2) × (N-3) combinations.Example: 8 models = 8 × 7 × 6 × 5 = 1,680 combinations

Metrics Tracked

The benchmark collects comprehensive statistics:

Per-Model Metrics

Win Rate

Percentage of games won when playing for a specific team

Hint Success Rate

How often hints lead to at least one correct guess

Guess Accuracy

Percentage of guesses that were correct

First Guess Accuracy

Success rate of the first guess each turn

Per-Combination Metrics

Games played: Total games for this combination
Win rates: Blue wins, red wins, draws
Average turns: Turns needed to complete games
Error rate: Frequency of errors or failures

Advanced Statistics

# Per model, per role, per team
metrics = {
    'games_played': 0,      # Total games in this role
    'wins': 0,              # Games won
    'turns_played': 0,      # Total turns taken
    'hints_given': 0,       # Total hints (hint givers only)
    'successful_hints': 0,  # Hints with ≥1 correct guess
    'guesses_made': 0,      # Total guesses attempted
    'correct_guesses': 0,   # Correct guesses
    'wrong_guesses': 0,     # Incorrect guesses
    'bomb_hits': 0,         # Bomb hits (instant loss)
    'invalid_offboard': 0,  # Guesses not on board
    'invalid_revealed': 0,  # Already revealed words
}

Benchmark Results

File Structure

benchmark_results/
├── benchmark_20260305_143022.json  # Raw benchmark data
└── analysis_plots/                  # Generated visualizations
    ├── model_win_rates.png
    ├── hint_success_rates.png
    ├── guess_accuracy.png
    └── turn_efficiency.png

Result Format

The JSON file contains:

{
  "benchmark_id": "benchmark_20260305_143022",
  "timestamp": "2026-03-05T14:30:22.123456",
  "total_combinations": 1680,
  "total_games": 3360,
  "model_performance": {
    "OpenRouterDevstral_hint_giver_blue": {
      "model": "OpenRouterDevstral",
      "role": "hint_giver",
      "team": "blue",
      "games_played": 420,
      "wins": 245,
      "hints_given": 1834,
      "successful_hints": 1567
    }
  },
  "team_combinations": { ... },
  "games": [ ... ]
}

Analysis Pipeline

The analysis script processes results through a modular pipeline:

Load data

Read the benchmark JSON file

Calculate metrics

Compute win rates, accuracy, efficiency

Generate insights

Identify patterns and top performers

Create visualizations

Generate charts and plots

Write report

Save markdown report with findings

Running Analysis

With Plots
Text Only

python analyze_benchmark_results.py benchmark_results/benchmark_20260305_143022.json

Generates full analysis with visualizations.

python analyze_benchmark_results.py benchmark_results/benchmark_20260305_143022.json --no-plots

Generates insights report without creating plot files.

Programmatic Benchmarking

Run benchmarks programmatically for custom workflows:

from benchmark import BenchmarkRunner
from agents.llm import BAMLModel

# Create runner
runner = BenchmarkRunner(
    games_per_combination=3,
    verbose=True
)

# Run benchmark
result = runner.run()

# Save results
main_file = result.save("my_benchmark_results")
print(f"Results saved to: {main_file}")

# Access data
for model_key, metrics in result.model_performance.items():
    if metrics['games_played'] > 0:
        win_rate = metrics['wins'] / metrics['games_played']
        print(f"{model_key}: {win_rate:.1%} win rate")

Best Practices

Start with small tests

Begin with 1-2 games per combination to verify setup:

GAMES_PER_COMBINATION = 1  # Quick validation run

Then increase for production benchmarks:

GAMES_PER_COMBINATION = 5  # Better statistical significance

Use free models first

Test your benchmark setup with free OpenRouter models before running expensive models:

BENCHMARK_MODELS = [
    BAMLModel.OPENROUTER_DEVSTRAL,
    BAMLModel.OPENROUTER_MIMO_V2_FLASH,
]

Monitor API costs

Track costs during benchmarks:

Set spending limits in provider dashboards
Use cost estimation before running
Monitor usage during execution

See Cost Management for details.

Check for temperature restrictions

Some models require specific temperature settings:

# The benchmark validates models automatically
runner._validate_models()  # Warns about restrictions

# o-series models require temperature=1.0
BAMLModel.O1, BAMLModel.O3, BAMLModel.O4_MINI

Interpreting Results

Key metrics to evaluate:

Model Performance

Win rate > 60%: Strong performer
Hint success > 70%: Effective spymaster
Guess accuracy > 60%: Reliable operative
First guess accuracy > 50%: Good hint interpretation

Team Synergies

Some model combinations work better together:

Top combination: GPT-5 Mini + GPT-5 Mini vs Claude Sonnet 4.5 + Claude Sonnet 4.5
Blue Win Rate: 73.2% (41/56 games)
Avg Turns: 12.4

Look for patterns in successful pairings.

Troubleshooting

Rate limits: If you hit rate limits, the benchmark will retry with exponential backoff. Consider:

Reducing GAMES_PER_COMBINATION
Using fewer models
Spreading runs across multiple days

Long runtime: Large benchmarks take time. Example:

8 models × 2 games = ~6-8 hours
Use VERBOSE=True to monitor progress
Results are saved incrementally

Next Steps

Model Selection

Choose the best models for your use case

Cost Management

Estimate and optimize benchmark costs

Get Started

Core Concepts

Guides

Advanced

Overview

Quick Start

Configuring Benchmarks

Select Models to Test

Adjust Games Per Combination

How Combinations Work

Metrics Tracked

Per-Model Metrics

Win Rate

Hint Success Rate

Guess Accuracy

First Guess Accuracy

Per-Combination Metrics

Advanced Statistics

Benchmark Results

File Structure

Result Format

Analysis Pipeline

Running Analysis

Programmatic Benchmarking

Best Practices

Interpreting Results

Model Performance

Team Synergies

Troubleshooting

Next Steps

Model Selection

Cost Management

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Quick Start

​Configuring Benchmarks

​Select Models to Test

​Adjust Games Per Combination

​How Combinations Work

​Metrics Tracked

​Per-Model Metrics

Win Rate

Hint Success Rate

Guess Accuracy

First Guess Accuracy

​Per-Combination Metrics

​Advanced Statistics

​Benchmark Results

​File Structure

​Result Format

​Analysis Pipeline

​Running Analysis

​Programmatic Benchmarking

​Best Practices

​Interpreting Results

​Model Performance

​Team Synergies

​Troubleshooting

​Next Steps

Model Selection

Cost Management

Build docs developers (and LLMs) love

Overview

Quick Start

Configuring Benchmarks

Select Models to Test

Adjust Games Per Combination

How Combinations Work

Metrics Tracked

Per-Model Metrics

Per-Combination Metrics

Advanced Statistics

Benchmark Results

File Structure

Result Format

Analysis Pipeline

Running Analysis

Programmatic Benchmarking

Best Practices

Interpreting Results

Model Performance

Team Synergies

Troubleshooting

Next Steps