Agents in this benchmark implement two distinct roles from the Codenames game: the HintGiver (spymaster) who sees all card colors, and the Guesser (field operative) who only sees words. This separation tests different AI capabilities and enables benchmarking of cross-model coordination.
Agent implementations are in the agents/ directory. The base interfaces are in agents/base.py, and LLM agents use BAML in agents/llm/baml_agents.py.
@dataclassclass HintResponse: word: str # Single word hint (no spaces) count: int # How many words it relates to (1-9) def validate(self) -> Tuple[bool, str]: # Validates hint format and constraints pass
Hint Rules:
Must be a single word (no spaces)
Cannot be any word currently on the board
Count indicates how many of your team’s words relate to the hint
The Guesser only sees words (not colors) and must guess based on hints.
agents/base.py
from agents.base import Guesserfrom game import Team, CardColorfrom typing import Listclass Guesser(ABC): def __init__(self, team: Team): self.team = team @abstractmethod def make_guesses( self, hint_word: str, # The hint word hint_count: int, # How many words it relates to board_words: List[str], # All words on board revealed_words: List[str] # Already revealed words ) -> List[str]: # Words to guess (ordered) pass def process_result(self, guessed_word: str, was_correct: bool, color: CardColor): # Optional: receive feedback after each guess pass def reset(self): # Optional: clear state between games pass
Universal LLM agents that work with any model provider using BAML for type-safe structured outputs.
agents/llm/baml_agents.py
from agents.llm import BAMLHintGiver, BAMLGuesser, BAMLModelfrom game import Team# Create agents with specific modelshint_giver = BAMLHintGiver(Team.BLUE, model=BAMLModel.GPT4O_MINI)guesser = BAMLGuesser(Team.RED, model=BAMLModel.CLAUDE_SONNET_45)
# Claude 4.5 Series (Latest)BAMLModel.CLAUDE_SONNET_45 # 1M contextBAMLModel.CLAUDE_HAIKU_45 # Fast & affordable# Claude 4.x SeriesBAMLModel.CLAUDE_OPUS_41 # Most capableBAMLModel.CLAUDE_SONNET_4BAMLModel.CLAUDE_OPUS_4# Claude 3.x SeriesBAMLModel.CLAUDE_SONNET_37BAMLModel.CLAUDE_HAIKU_35BAMLModel.CLAUDE_HAIKU_3
BAML agents use prompts defined in baml_src/main.baml. You can customize them:
baml_src/main.baml
function GiveHint( team: string, my_words: string[], opponent_words: string[], neutral_words: string[], bomb_words: string[], revealed_words: string[]) -> HintResponse { client GPT4oMini prompt #" You are playing Codenames as the {{ team | upper }} team's spymaster. YOUR GOAL: Give a one-word hint and a number to help your teammate. YOUR TEAM'S WORDS: {{ my_words | join(', ') }} OPPONENT'S WORDS (avoid): {{ opponent_words | join(', ') }} BOMB WORD(S) (NEVER hint at these): {{ bomb_words | join(', ') }} // Add your custom strategy instructions here STRATEGY: - Look for semantic clusters - Balance safety vs. aggressiveness - Avoid risky hints near the bomb {{ ctx.output_format }} "#}
After editing, regenerate the client:
baml generate
See the BAML Integration page for more details on prompt engineering and testing.
This architecture enables testing different aspects of AI capability:
Information Access
HintGiver has complete information (all colors) and must compress knowledge into a single word.Guesser has incomplete information and must interpret hints correctly.This tests how well models handle information asymmetry.
Cognitive Skills
HintGiver needs:
Semantic clustering (finding common themes)
Risk assessment (avoiding opponent words and bombs)
Strategic planning (maximizing points vs. safety)
Guesser needs:
Semantic reasoning (word associations)
Confidence calibration (knowing when to stop)
Context integration (using previous hints)
Cross-Model Coordination
You can pair different models as HintGiver and Guesser to test:
Which models work well together
Communication effectiveness across model families
Cost optimization (cheap guesser + expensive hint giver)