Prompt Engineering

Overview

The Codenames AI Benchmark uses BAML (Basically A Markup Language) for prompt management. BAML provides structured prompting with type safety, template rendering, and automatic output parsing.

BAML Architecture

Prompts are defined in baml_src/main.baml and compiled into type-safe Python functions:

baml_src/
├── main.baml          # Prompt definitions
└── clients.baml       # LLM client configurations

Schema Definitions

BAML uses strongly-typed schemas for inputs and outputs:

baml_src/main.baml

class HintResponse {
  word string @description("One-word hint (no spaces, not on the board)")
  count int @description("Number of words this hint relates to (1-9)")
  reasoning string @description("Brief explanation of strategy and word associations")
}

class GuessResponse {
  guesses string[] @description("List of words to guess, ordered by confidence (most confident first)")
  reasoning string @description("Brief explanation of why these words relate to the hint")
}

The @description annotations help the LLM understand what each field represents, improving output quality.

Hint Giver Prompt

The GiveHint function generates hints for spymasters:

baml_src/main.baml

function GiveHint(
  team: string @description("Team color: 'blue' or 'red'"),
  my_words: string[] @description("Your team's unrevealed words that need to be guessed"),
  opponent_words: string[] @description("Opponent's unrevealed words to avoid"),
  neutral_words: string[] @description("Neutral unrevealed words to avoid"),
  bomb_words: string[] @description("The bomb word(s) - NEVER hint at these or you lose!"),
  revealed_words: string[] @description("Already revealed words (for context)")
) -> HintResponse {
  client GPT4oMini  // Default client - can be overridden at runtime

  prompt #"
    You are playing Codenames as the {{ team | upper }} team's spymaster.

    YOUR GOAL: Give a one-word hint and a number to help your teammate guess your team's words.

    YOUR TEAM'S WORDS (need to be guessed):
    {{ my_words | join(', ') }}

    OPPONENT'S WORDS (avoid these):
    {{ opponent_words | join(', ') }}

    NEUTRAL WORDS (avoid these):
    {{ neutral_words | join(', ') }}

    BOMB WORD(S) (NEVER hint at these):
    {{ bomb_words | join(', ') }}

    {% if revealed_words | length > 0 %}
    ALREADY REVEALED:
    {{ revealed_words | join(', ') }}
    {% endif %}

    RULES:
    1. Give a ONE-WORD hint (no spaces, no words from the board)
    2. Give a NUMBER indicating how many of your words relate to this hint
    3. Your hint should connect multiple of your words if possible
    4. Avoid hints that could lead to opponent words, neutral words, or the BOMB(S)
    5. Be strategic - think about semantic associations, categories, and relationships

    STRATEGY TIPS:
    - Look for semantic clusters (e.g., "animal" for dog, cat, mouse)
    - Consider word relationships (e.g., "royalty" for king, queen, crown)
    - Balance safety vs. aggressiveness based on game state
    - Avoid risky hints that could lead to the bomb(s)

    {{ ctx.output_format }}
  "#
}

Guesser Prompt

The MakeGuesses function handles field operative guessing:

baml_src/main.baml

function MakeGuesses(
  team: string @description("Team color: 'blue' or 'red'"),
  hint_word: string @description("The hint word given by your spymaster"),
  hint_count: int @description("Number of words the hint relates to"),
  board_words: string[] @description("All words currently on the board"),
  revealed_words: string[] @description("Already revealed words (don't guess these)")
) -> GuessResponse {
  client GPT4oMini  // Default client - can be overridden at runtime

  prompt #"
    You are playing Codenames as the {{ team | upper }} team's field operative.

    YOUR HINT: "{{ hint_word }}" ({{ hint_count }})
    This means your spymaster wants you to guess {{ hint_count }} word(s) related to "{{ hint_word }}".

    WORDS ON THE BOARD (unrevealed):
    {% for word in board_words if word not in revealed_words %}{{ word }}{% if not loop.last %}, {% endif %}{% endfor %}

    {% if revealed_words | length > 0 %}
    ALREADY REVEALED (don't guess these):
    {{ revealed_words | join(', ') }}
    {% endif %}

    YOUR TASK:
    1. Identify which unrevealed words relate to the hint "{{ hint_word }}"
    2. Return up to {{ hint_count }} words (you can guess fewer if unsure)
    3. Order them by confidence (most confident first)
    4. You can optionally guess {{ hint_count + 1 }} words if you want to use a previous hint

    IMPORTANT:
    - Only guess words from the unrevealed list above
    - If you guess wrong, your turn ends immediately
    - If you hit the bomb, your team loses the game
    - Be thoughtful - quality over quantity
    - It's better to guess fewer words confidently than to guess risky words

    STRATEGY:
    - Think about semantic relationships and word associations
    - Consider multiple meanings of the hint word
    - Rank words by how strongly they relate to the hint
    - If uncertain about a word, leave it out

    {{ ctx.output_format }}
  "#
}

Prompt Engineering Best Practices

Clear Structure

Organize prompts with clear sections: goal, inputs, rules, and strategy

Explicit Examples

Show concrete examples of good hints and guesses in strategy sections

Safety First

Emphasize bomb avoidance and risk management prominently

Output Format

Use {{ ctx.output_format }} for automatic schema documentation

Template Syntax

BAML uses Jinja2-style templating:

Variable Interpolation

{{ team }}                    # Simple variable
{{ team | upper }}            # With filter
{{ my_words | join(', ') }}   # Array joining

Conditional Logic

{% if revealed_words | length > 0 %}
ALREADY REVEALED:
{{ revealed_words | join(', ') }}
{% endif %}

Loops

{% for word in board_words %}
- {{ word }}
{% endfor %}

# Inline loop with filtering
{% for word in board_words if word not in revealed_words %}{{ word }}{% endfor %}

Customizing Prompts

To modify prompts, edit baml_src/main.baml and regenerate the client:

Terminal

# After editing main.baml
baml-cli generate

# Or if using npm script
npm run baml:generate

Always regenerate the BAML client after editing prompt files, or your changes won’t take effect.

Advanced Techniques

Chain-of-Thought Prompting

Add reasoning steps to improve hint quality:

prompt #"
  STRATEGY PROCESS:
  1. First, identify semantic clusters in your team's words
  2. Consider which hint maximizes coverage while minimizing risk
  3. Check if your hint could accidentally relate to opponent/neutral/bomb words
  4. Finalize your hint word and count
  
  Now provide your hint:
  {{ ctx.output_format }}
"#

Few-Shot Examples

Include examples of good gameplay:

prompt #"
  EXAMPLE HINTS:
  - Team words: [dog, cat, lion] → Hint: "ANIMAL" (3)
  - Team words: [king, queen, crown] → Hint: "ROYALTY" (3)
  - Team words: [ocean, wave, beach] → Hint: "WATER" (3)
  
  Now generate your hint for these words:
  {{ my_words | join(', ') }}
"#

Role-Specific Instructions

Tailor prompts for different model strengths:

function GiveHintConservative(
  # ... parameters
) -> HintResponse {
  client GPT4oMini
  
  prompt #"
    STRATEGY: CONSERVATIVE PLAY
    - Prefer hints that connect 2-3 words safely
    - Avoid any hints with ambiguity
    - Prioritize bomb avoidance over aggressive plays
    # ...
  "#
}

function GiveHintAggressive(
  # ... parameters  
) -> HintResponse {
  client GPT4o
  
  prompt #"
    STRATEGY: AGGRESSIVE PLAY
    - Try to connect 3-4+ words when possible
    - Accept calculated risks for higher payoff
    - Use creative associations
    # ...
  "#
}

Testing Prompts

Create a test script to evaluate prompt changes:

test_prompts.py

from agents.llm.baml_agents import BAMLHintGiver, BAMLModel
from game import Team

# Test with different models
models = [
    BAMLModel.GPT4O_MINI,
    BAMLModel.CLAUDE_SONNET_45,
    BAMLModel.GEMINI_25_FLASH
]

test_scenario = {
    "my_words": ["dog", "cat", "mouse"],
    "opponent_words": ["tree", "rock"],
    "neutral_words": ["table", "chair", "book"],
    "bomb_words": ["bomb"],
    "revealed_words": [],
    "board_words": ["dog", "cat", "mouse", "tree", "rock", 
                    "table", "chair", "book", "bomb"]
}

for model in models:
    hint_giver = BAMLHintGiver(Team.BLUE, model)
    response = hint_giver.give_hint(**test_scenario)
    print(f"{model.value}: {response.word} ({response.count})")

Prompt Optimization Metrics

Measure prompt effectiveness:

Hint Success Rate: % of hints leading to correct guesses
Average Hint Count: Words attempted per hint
Hint Efficiency: Correct guesses / hint count
Bomb Avoidance: % of games without bomb hits
First Guess Accuracy: % correct on first guess

See Analysis Metrics for detailed tracking.

Model-Specific Considerations

OpenAI (GPT)
Anthropic (Claude)
Google (Gemini)
DeepSeek

Excels at creative associations
May need explicit safety reminders
Good at following structured formats
Consider temperature tuning (0.3-0.7)

Common Pitfalls

Avoid These Mistakes:

Overly verbose prompts that dilute key instructions
Forgetting to emphasize bomb avoidance
Not specifying output format requirements
Using ambiguous language in rules
Failing to regenerate after prompt changes

Get Started

Core Concepts

Guides

Advanced

Overview

BAML Architecture

Schema Definitions

Hint Giver Prompt

Guesser Prompt

Prompt Engineering Best Practices

Clear Structure

Explicit Examples

Safety First

Output Format

Template Syntax

Variable Interpolation

Conditional Logic

Loops

Customizing Prompts

Advanced Techniques

Chain-of-Thought Prompting

Few-Shot Examples

Role-Specific Instructions

Testing Prompts

Prompt Optimization Metrics

Model-Specific Considerations

Common Pitfalls

Next Steps

Custom Agents

Analysis Metrics

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​BAML Architecture

​Schema Definitions

​Hint Giver Prompt

​Guesser Prompt

​Prompt Engineering Best Practices

Clear Structure

Explicit Examples

Safety First

Output Format

​Template Syntax

​Variable Interpolation

​Conditional Logic

​Loops

​Customizing Prompts

​Advanced Techniques

​Chain-of-Thought Prompting

​Few-Shot Examples

​Role-Specific Instructions

​Testing Prompts

​Prompt Optimization Metrics

​Model-Specific Considerations

​Common Pitfalls

​Next Steps

Custom Agents

Analysis Metrics

Build docs developers (and LLMs) love

Overview

BAML Architecture

Schema Definitions

Hint Giver Prompt

Guesser Prompt

Prompt Engineering Best Practices

Template Syntax

Variable Interpolation

Conditional Logic

Loops

Customizing Prompts

Advanced Techniques

Chain-of-Thought Prompting

Few-Shot Examples

Role-Specific Instructions

Testing Prompts

Prompt Optimization Metrics

Model-Specific Considerations

Common Pitfalls

Next Steps