Quick Start

Get up and running with the Codenames AI Benchmark in minutes. This guide will have you watching AI models compete in a strategic word game.

Time to complete: ~5 minutesWhat you’ll do: Install dependencies, set up one API key, and run a demo game

Prerequisites

Python 3.8 or higher
pip (Python package manager)
At least one API key from: OpenAI, Anthropic, Google, xAI, DeepSeek, or OpenRouter

Recommended for testing: Get a free OpenRouter API key to access multiple models without cost.

Installation & Setup

Clone the repository

git clone https://github.com/your-org/code-names-benchmark.git
cd code-names-benchmark

Install dependencies

pip install -r requirements.txt

This installs:

baml-py==0.211.2 - Structured LLM outputs
openai>=1.0.0 - OpenAI, xAI, and DeepSeek client
anthropic>=0.18.0 - Claude models
google-generativeai>=0.3.0 - Gemini models
python-dotenv>=1.0.0 - Environment variable loading
Analysis tools: pandas, numpy, matplotlib, seaborn, scipy

Configure API keys

Copy the example environment file and add your API key:

cp .env.example .env

Edit .env and add at least one API key:

# Choose at least one provider:

# OpenRouter (free models available - recommended for testing)
OPENROUTER_API_KEY=your_openrouter_key_here

# OpenAI (GPT-5, GPT-4.1, o-series, GPT-4o)
OPENAI_API_KEY=your_openai_key_here

# Anthropic (Claude Sonnet 4.5, Haiku 4.5, Opus 4.1)
ANTHROPIC_API_KEY=your_anthropic_key_here

# Google (Gemini 2.5, Gemini 2.0)
GOOGLE_API_KEY=your_google_key_here

# xAI (Grok 4, Grok 3)
XAI_API_KEY=your_xai_key_here

# DeepSeek (DeepSeek V3.2)
DEEPSEEK_API_KEY=your_deepseek_key_here

OpenRouter

Many free models - best for testing

OpenAI

GPT-5, GPT-4.1, reasoning models

Anthropic

Claude Sonnet 4.5, Haiku, Opus

Google

Gemini 2.5 Pro, Flash variants

xAI

Grok 4, Grok 3 models

DeepSeek

DeepSeek V3.2 Chat, Reasoner

Run the demo game

The demo is pre-configured to use free OpenRouter models:

python3 demo_simple_game.py

You’ll see verbose output showing:

Game setup and board state (25 words)
Each hint given by spymasters
Each guess made by field operatives
Turn-by-turn results
Final game outcome and statistics

Games typically take 1-2 minutes to complete as AI models think through their moves.

Understanding the Output

Game Setup Phase

======================================================================
  CODENAMES: MULTI-MODEL DEMO
======================================================================

--- Configured Models ---
  Blue Team Hint Giver: Devstral
  Blue Team Guesser: MIMO V2 Flash
  Red Team Hint Giver: OLMo 3.1 32B
  Red Team Guesser: Nemotron Nano 12B

--- Checking API Keys ---
  [OK] Devstral - API key found
  [OK] MIMO V2 Flash - API key found
  [OK] OLMo 3.1 32B - API key found
  [OK] Nemotron Nano 12B - API key found

--- Board Layout (Spymaster View) ---
   1. [BLUE] APPLE
   2. [RED] TIGER
   3. [NEUTRAL] CLOUD
   ...

Turn-by-Turn Gameplay

--- Turn 1 (BLUE TEAM) ---
Hint: 'FRUIT' for 2 word(s)
Guesses:
  1. [CORRECT] APPLE (blue)
  2. [CORRECT] BANANA (blue)
  3. [WRONG] LEMON (neutral) - Turn ends

--- Turn 2 (RED TEAM) ---
Hint: 'ANIMAL' for 3 word(s)
Guesses:
  1. [CORRECT] TIGER (red)
  2. [CORRECT] LION (red)
  3. [WRONG] EAGLE (blue) - Turn ends

Final Results

======================================================================
  FINAL RESULTS
======================================================================

Game Outcome: WINNER

Winner: BLUE TEAM!

Game Statistics:
  • Total Turns: 12
  • Blue Team Words Remaining: 0
  • Red Team Words Remaining: 2

Customize Your Game

Edit the Players class in demo_simple_game.py to try different models:

demo_simple_game.py

class Players:
    # Change these to any available model!
    BLUE_HINT_GIVER = BAMLModel.GPT5_MINI
    BLUE_GUESSER = BAMLModel.GPT5_MINI
    RED_HINT_GIVER = BAMLModel.CLAUDE_SONNET_45
    RED_GUESSER = BAMLModel.CLAUDE_SONNET_45

Available Free Models (OpenRouter)

# No API costs - perfect for testing!
BAMLModel.OPENROUTER_DEVSTRAL              # Devstral (Mistral)
BAMLModel.OPENROUTER_MIMO_V2_FLASH         # MIMO V2 Flash
BAMLModel.OPENROUTER_NEMOTRON_NANO         # Nemotron Nano 12B
BAMLModel.OPENROUTER_DEEPSEEK_R1T_CHIMERA  # DeepSeek R1T Chimera
BAMLModel.OPENROUTER_GLM_45_AIR            # GLM 4.5 Air
BAMLModel.OPENROUTER_LLAMA_33_70B          # Llama 3.3 70B
BAMLModel.OPENROUTER_OLMO3_32B             # OLMo 3.1 32B

Frontier Models

# OpenAI GPT-5 Series
BAMLModel.GPT5              # GPT-5 (most capable)
BAMLModel.GPT5_MINI         # GPT-5 Mini (balanced)
BAMLModel.GPT5_NANO         # GPT-5 Nano (fast)

# Anthropic Claude 4.5
BAMLModel.CLAUDE_SONNET_45  # Claude Sonnet 4.5 (excellent performance)
BAMLModel.CLAUDE_HAIKU_45   # Claude Haiku 4.5 (fast, affordable)

# Google Gemini 2.5
BAMLModel.GEMINI_25_PRO     # Gemini 2.5 Pro (most capable)
BAMLModel.GEMINI_25_FLASH   # Gemini 2.5 Flash (fast)

# DeepSeek V3.2
BAMLModel.DEEPSEEK_CHAT     # DeepSeek V3.2 (very cost-effective)
BAMLModel.DEEPSEEK_REASONER # DeepSeek V3.2 with reasoning

Run Programmatically

Create your own game script:

custom_game.py

from utils import generate_word_list
from game import Board, Team
from agents.llm import BAMLHintGiver, BAMLGuesser, BAMLModel
from orchestrator import GameRunner

# Generate board
words = generate_word_list(25)
board = Board(words)

# Create agents - mix and match any models!
runner = GameRunner(
    board=board,
    blue_hint_giver=BAMLHintGiver(Team.BLUE, model=BAMLModel.GPT5_MINI),
    blue_guesser=BAMLGuesser(Team.BLUE, model=BAMLModel.GPT5_MINI),
    red_hint_giver=BAMLHintGiver(Team.RED, model=BAMLModel.CLAUDE_SONNET_45),
    red_guesser=BAMLGuesser(Team.RED, model=BAMLModel.CLAUDE_SONNET_45),
    verbose=True
)

# Run game
result = runner.run()
print(f"Winner: {result.winner}, Turns: {result.total_turns}")

Troubleshooting

"No API keys found" error

Make sure you copied .env.example to .env
Verify your API key is valid and not expired
Check that the key name matches exactly (e.g., OPENROUTER_API_KEY)
Ensure no extra spaces around the = sign

"ModuleNotFoundError" error

Run pip install -r requirements.txt to install all dependencies.If using a virtual environment, make sure it’s activated before installing.

"Rate limit exceeded" error

Free tier limits: OpenRouter free models have usage limits
Solution 1: Wait a few minutes and try again
Solution 2: Switch to a different provider
Solution 3: Upgrade to a paid tier for higher limits

Game runs very slowly

This is normal! LLM API calls take 2-10 seconds each. A complete game makes 20-50+ API calls.Tips:

Use verbose=True to see progress
Try faster models (Haiku, Flash Lite, Mini variants)
Be patient - strategic thinking takes time!

BAML generation errors

If you see errors about BAML client code:

baml generate

This regenerates the client from baml_src/ definitions.

Next Steps

Detailed Installation

Complete setup guide with verification steps

Run Benchmarks

Evaluate models across multiple games

Configure Games

Customize board size, rules, and parameters

Edit Prompts

Modify AI agent behavior with BAML

Get Started

Core Concepts

Guides

Advanced

Quick Start

Quick Start

Prerequisites

Installation & Setup

OpenRouter

OpenAI

Anthropic

Google

xAI

DeepSeek

Understanding the Output

Game Setup Phase

Turn-by-Turn Gameplay

Final Results

Customize Your Game

Available Free Models (OpenRouter)

Frontier Models

Run Programmatically

Troubleshooting

Next Steps

Detailed Installation

Run Benchmarks

Configure Games

Edit Prompts

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Quick Start

​Prerequisites

​Installation & Setup

OpenRouter

OpenAI

Anthropic

Google

xAI

DeepSeek

​Understanding the Output

​Game Setup Phase

​Turn-by-Turn Gameplay

​Final Results

​Customize Your Game

​Available Free Models (OpenRouter)

​Frontier Models

​Run Programmatically

​Troubleshooting

​Next Steps

Detailed Installation

Run Benchmarks

Configure Games

Edit Prompts

Build docs developers (and LLMs) love

Quick Start

Prerequisites

Installation & Setup

Understanding the Output

Game Setup Phase

Turn-by-Turn Gameplay

Final Results

Customize Your Game

Available Free Models (OpenRouter)

Frontier Models

Run Programmatically

Troubleshooting

Next Steps