Skip to main content

Overview

The analysis pipeline provides a unified interface for running all benchmark metrics and analysis functions. It processes game results and generates comprehensive statistics including win rates, Elo ratings, efficiency metrics, error patterns, and more.

Classes

AnalysisResult

Complete analysis results containing all computed metrics and statistics.
benchmark_id
str
required
Identifier for the benchmark being analyzed
combo_df
pd.DataFrame
required
Team combination statistics with win rates and game counts
role_perf_df
pd.DataFrame
required
Model performance aggregated by role (hint giver/guesser)
model_perf_df
pd.DataFrame
required
Individual model performance metrics
elo_df
pd.DataFrame
required
Elo ratings for each model by role
ci_df
pd.DataFrame
required
Wilson confidence intervals for win rates
fma
dict
required
First mover advantage statistics
momentum_df
pd.DataFrame
required
Game momentum and competitiveness metrics
momentum_summary
dict
required
Aggregated momentum statistics
top_hint
pd.DataFrame
required
Best performing hint givers
top_guess
pd.DataFrame
required
Best performing guessers
dominant
pd.DataFrame
required
Dominant team combinations
synergies
pd.DataFrame
required
Model synergy analysis (best pairings)
hint_efficiency
pd.DataFrame
required
Hint giver efficiency metrics
guesser_perf
pd.DataFrame
required
Detailed guesser performance metrics
role_versatility
pd.DataFrame
required
Model versatility across roles
error_summary
pd.DataFrame
required
Error counts by model and type
error_patterns
dict
required
Detailed error pattern analysis
hint_patterns
dict
required
Hint word patterns and statistics
game_efficiency
pd.DataFrame
required
Game efficiency by team combination
efficiency_by_model
pd.DataFrame
required
Efficiency aggregated by model
matchup_matrix_hg
tuple[pd.DataFrame, pd.DataFrame]
required
Head-to-head matchup matrix for hint givers (win rates, game counts)
matchup_matrix_g
tuple[pd.DataFrame, pd.DataFrame]
required
Head-to-head matchup matrix for guessers (win rates, game counts)

Functions

run_pipeline()

Run the complete analysis pipeline on benchmark results.
def run_pipeline(results_path: Path) -> AnalysisResult
results_path
Path
required
Path to benchmark results directory containing game snapshots
Returns: AnalysisResult containing all computed metrics and statistics.

Usage Example

from pathlib import Path
from analysis.pipeline import run_pipeline
from analysis.viz import create_visualizations

# Run full analysis pipeline
results_path = Path("benchmark_results/my_benchmark")
result = run_pipeline(results_path)

# Access specific metrics
print(f"Benchmark ID: {result.benchmark_id}")
print(f"\nTop 5 Hint Givers:")
print(result.top_hint.head())

print(f"\nTop 5 Guessers:")
print(result.top_guess.head())

print(f"\nElo Ratings:")
print(result.elo_df.head(10))

print(f"\nFirst Mover Advantage:")
print(f"  Blue win rate: {result.fma['overall_blue_win_rate']:.1%}")
print(f"  Blue advantage: {result.fma['blue_advantage']:.1%}")

# Generate visualizations
output_dir = Path("analysis_output")
create_visualizations(result, output_dir)

Pipeline Stages

The pipeline executes the following analysis stages:

1. Data Loading

Loads and parses benchmark results from disk.

2. Win Rate Analysis

  • Team combination statistics
  • Role-based performance
  • First mover advantage
  • Best hint givers and guessers
  • Dominant combinations
  • Model synergies

3. Elo Rating Computation

Calculates Elo ratings for each model in hint giver and guesser roles.

4. Confidence Intervals

Computes Wilson score confidence intervals for win rates.

5. Momentum Analysis

  • Game momentum tracking
  • Lead changes
  • Comeback statistics
  • Competitiveness metrics

6. Role-Specific Metrics

  • Hint efficiency (correct guesses per hint)
  • Guesser performance (accuracy, bomb rate)
  • Role versatility scores
  • Head-to-head matchup matrices

7. Error Analysis

  • Error counts by model and type
  • Bomb hit contexts
  • Invalid guess patterns
  • Wrong guess color distribution

8. Hint Pattern Analysis

  • Hint word frequency
  • Hint count distribution
  • Success rates by hint count
  • Hint creativity ratio
  • Average efficiency

9. Efficiency Metrics

  • Average turns per game
  • Wins per turn
  • Efficiency by model
  • Turn-to-win ratios

Performance Considerations

The pipeline processes all games in memory. For very large benchmarks (10,000+ games), consider:
  • Processing in batches
  • Using chunked analysis
  • Increasing available memory

Output Format

All DataFrames use consistent column naming:
  • model: Model identifier
  • role: “hint_giver” or “guesser”
  • team: “blue” or “red”
  • games_played: Number of games
  • wins: Number of wins
  • win_rate: Win percentage (0-1)

Example Output Structure

# Top hint givers
result.top_hint
#              model              role  avg_win_rate  games_played  total_wins
# 0  gpt-4o-mini      Blue Hint Giver         0.652            23          15
# 1  claude-haiku-3.5 Blue Hint Giver         0.621            29          18

# Elo ratings
result.elo_df
#              model  elo_hint_giver  elo_guesser  elo_combined  elo_best_role
# 0  gpt-4o-mini              1547         1523          1535     hint_giver
# 1  claude-haiku-3.5         1538         1512          1525     hint_giver

# Error summary
result.error_summary
#              model  bomb_hits  invalid_offboard  invalid_revealed  total_errors
# 0  gpt-4o-mini              2                 1                 0            3

Visualization Integration

The AnalysisResult is designed to work seamlessly with the visualization module:
from analysis.viz import create_visualizations

# Generate all visualizations
create_visualizations(result, output_dir)
See Visualization for details on available charts.

Build docs developers (and LLMs) love