Overview
The Codenames AI Benchmark includes a sophisticated analysis pipeline for measuring agent performance across multiple dimensions. All metrics are computed from game result JSON files.Analysis Pipeline
The analysis system is located in theanalysis/ directory:
Core Data Models
The analysis system uses structured data models defined inanalysis/models.py:
analysis/models.py
Win Rate Metrics
Track overall success rates by team, role, and combination:analysis/metrics/win_rates.py
Example Usage
Efficiency Metrics
Measure how effectively agents win games:analysis/metrics/efficiency.py
- High efficiency = Wins quickly with fewer turns
- Low efficiency = Takes many turns to win (or loses)
- Useful for comparing aggressive vs. conservative strategies
Hint Quality Metrics
Analyze hint patterns and effectiveness:analysis/metrics/hints.py
Key Hint Metrics
Success Rate
Percentage of hints that lead to at least one correct guess
Perfect Hint Rate
Percentage of hints where all guesses are correct
Creativity Ratio
Unique hints / total hints - measures hint diversity
Efficiency
Average correct guesses per hint count - measures precision
Elo Rating System
Compute skill ratings using Elo algorithm:analysis/metrics/elo.py
- All models start at initial_rating (default 1500)
- After each game, winners gain points, losers lose points
- Points transferred depend on rating difference (upsets = bigger swings)
- Ratings stabilize after ~30-50 games per model
- 1500: Average/baseline
- 1600+: Strong performer
- 1400-: Weak performer
- 100 point difference ≈ 64% expected win rate
Error Analysis
Identify failure patterns and problem areas:analysis/metrics/errors.py
Error Categories
- Bomb Hits: Guessed the bomb word (instant loss)
- Invalid Offboard: Guessed word not on board
- Invalid Revealed: Guessed already-revealed word
- Wrong Guesses: Valid but incorrect guesses (opponent/neutral)
Running Full Analysis
Generate comprehensive reports:run_analysis.py
Custom Metrics
Create your own analysis functions:custom_metrics.py
Visualization Examples
Create charts from metrics:visualize.py
Metric Interpretation Guide
Win Rate vs Elo Rating
Win Rate vs Elo Rating
Win Rate: Simple percentage of games won. Easy to understand but doesn’t account for opponent strength.Elo Rating: Adjusts for opponent quality. A model with 60% win rate against strong opponents may have higher Elo than a model with 70% against weak opponents.Use Elo for rankings, win rate for absolute performance measurement.
Efficiency Metrics
Efficiency Metrics
High efficiency + High win rate: Dominant, wins quicklyHigh efficiency + Low win rate: Wins quickly when winning, but loses often (aggressive strategy)Low efficiency + High win rate: Grinds out wins in long games (conservative strategy)Low efficiency + Low win rate: Struggling, loses in long games
Hint Quality Metrics
Hint Quality Metrics
Perfect hint rate > Success rate: Hints are either perfect or total failures (risky strategy)Success rate high, efficiency low: Gives conservative hints that work but don’t maximize coverageHigh creativity ratio: Uses diverse vocabulary, less repetitiveLow creativity ratio: Falls back on safe, proven hints
Performance Benchmarking
Compare models systematically:Next Steps
Custom Agents
Build agents informed by metric insights
Prompt Engineering
Optimize prompts based on performance data