This benchmark uses BAML instead of direct API calls or other LLM frameworks for several key reasons:
Guaranteed Structure
LLMs sometimes return malformed JSON or extra text. BAML automatically validates and retries to ensure you always get valid HintResponse or GuessResponse objects.
Provider Agnostic
Write agent logic once, run on 50+ models across OpenAI, Anthropic, Google, DeepSeek, Grok, and more. Switch providers by changing a single line.
Prompt Engineering
Edit prompts in .baml files with syntax highlighting and templates. Test changes in the interactive playground before running expensive benchmarks.
Type Safety
BAML generates Python types from schemas. Your IDE autocompletes fields, and type checkers catch errors at development time.
class HintResponse { word string @description("One-word hint (no spaces, not on the board)") count int @description("Number of words this hint relates to (1-9)") reasoning string @description("Brief explanation of strategy and word associations")}
This generates a Python class that BAML populates from LLM output:
class GuessResponse { guesses string[] @description("List of words to guess, ordered by confidence") reasoning string @description("Brief explanation of why these words relate")}
Arrays in BAML map to Python lists:
# Usage in Pythonresponse: GuessResponse = b.MakeGuesses(...)for guess in response.guesses: # Type: List[str] print(guess)
The @description annotations are included in the prompt sent to the LLM, helping it understand what to generate.
function GiveHint( team: string @description("Team color: 'blue' or 'red'"), my_words: string[] @description("Your team's unrevealed words"), opponent_words: string[] @description("Opponent's unrevealed words to avoid"), neutral_words: string[] @description("Neutral unrevealed words to avoid"), bomb_words: string[] @description("The bomb word(s) - NEVER hint at these!"), revealed_words: string[] @description("Already revealed words (for context)")) -> HintResponse { client GPT4oMini // Default client prompt #" You are playing Codenames as the {{ team | upper }} team's spymaster. YOUR GOAL: Give a one-word hint and a number to help your teammate. YOUR TEAM'S WORDS: {{ my_words | join(', ') }} OPPONENT'S WORDS (avoid these): {{ opponent_words | join(', ') }} NEUTRAL WORDS (avoid these): {{ neutral_words | join(', ') }} BOMB WORD(S) (NEVER hint at these): {{ bomb_words | join(', ') }} {% if revealed_words | length > 0 %} ALREADY REVEALED: {{ revealed_words | join(', ') }} {% endif %} RULES: 1. Give a ONE-WORD hint (no spaces, no words from the board) 2. Give a NUMBER indicating how many of your words relate to this hint 3. Your hint should connect multiple of your words if possible 4. Avoid hints that could lead to opponent words, neutral words, or BOMB(S) {{ ctx.output_format }} "#}
function MakeGuesses( team: string @description("Team color: 'blue' or 'red'"), hint_word: string @description("The hint word given by your spymaster"), hint_count: int @description("Number of words the hint relates to"), board_words: string[] @description("All words currently on the board"), revealed_words: string[] @description("Already revealed words (don't guess these)")) -> GuessResponse { client GPT4oMini prompt #" You are playing Codenames as the {{ team | upper }} team's field operative. YOUR HINT: "{{ hint_word }}" ({{ hint_count }}) WORDS ON THE BOARD (unrevealed): {% for word in board_words if word not in revealed_words %}{{ word }}{% if not loop.last %}, {% endif %}{% endfor %} YOUR TASK: 1. Identify which unrevealed words relate to "{{ hint_word }}" 2. Return up to {{ hint_count }} words (you can guess fewer if unsure) 3. Order them by confidence (most confident first) IMPORTANT: - Only guess words from the unrevealed list above - If you guess wrong, your turn ends immediately - Be thoughtful - quality over quantity {{ ctx.output_format }} "#}
# HintResponse is automatically validatedassert isinstance(response.word, str)assert isinstance(response.count, int)assert response.count >= 1# Use the responsehint_response = HintResponse( word=response.word, count=response.count)
If the LLM returns invalid data that can’t be coerced to the schema, BAML will retry automatically (configured in clients.baml). If all retries fail, it raises an exception.
Make the consequences of hitting the bomb crystal clear:
BOMB WORD(S) (NEVER hint at these or YOUR TEAM LOSES INSTANTLY):{{ bomb_words | join(', ') }}
Encourage Clustering
Guide the model toward multi-word hints:
STRATEGY:- Try to connect 2-3 of your words with a single hint- Look for semantic relationships: categories, synonyms, associations- Higher counts win games faster, but be careful not to include risky words
Add Examples
Few-shot prompting can improve performance:
EXAMPLES:- If your words are ["dog", "cat", "mouse"], hint: "animal" (3)- If your words are ["king", "queen", "crown"], hint: "royalty" (3)
IMPORTANT:- Your turn ends IMMEDIATELY if you guess wrong- It's better to guess fewer words confidently than to risk a wrong guess- If you're unsure about a word, leave it out
Request Confidence Ordering
Make sure guesses are ordered by confidence:
Return guesses in order of confidence:1. Most confident guess first2. Second most confident3. Least confident (if applicable)
from baml_py import BamlValidationErrortry: response = b.GiveHint(...)except BamlValidationError as e: print(f"LLM returned invalid data: {e}") # BAML already retried automatically # This error means all retries failedexcept Exception as e: print(f"Unexpected error: {e}")