Skip to main content
The TextArenaEnv integration wraps TextArena text-based game environments for multi-turn interaction with language models. TextArena provides competitive and collaborative text-based games designed for LLM evaluation.

Features

  • Text-based games - Wordle, 20 Questions, Poker, and more
  • Multi-turn interaction - Games require multiple model responses
  • Efficient memory sharing - Optimized for parallel rollouts
  • Custom feedback - Transform game observations for better prompting
  • XML formatting - Built-in parser for structured responses

Installation

Install with TextArena support:
uv add 'verifiers[ta]'
This installs:
  • textarena - TextArena game library
  • nltk - Natural language processing (for word games)

Quick Start

1

Create environment

Create a basic Wordle environment:
import verifiers as vf
from verifiers.envs.integrations.textarena_env import TextArenaEnv

def load_environment():
    return TextArenaEnv(
        game="Wordle-v0",
        num_train_examples=1000,
        num_eval_examples=100,
        seed=0,
    )
2

Evaluate

Run an evaluation:
prime eval run my-wordle-env -m openai/gpt-4.1-mini -n 20

Available Games

TextArena provides several game types:

Word Games

  • Wordle-v0 - Classic Wordle game
  • WordChain-v0 - Word association chains
  • Scrabble-v0 - Scrabble with simplified rules

Logic Games

  • TwentyQuestions-v0 - Guess the object
  • Mastermind-v0 - Code-breaking game

Strategy Games

  • Chess-v0 - Text-based chess
  • Go-v0 - Text-based Go
  • Poker-v0 - Texas Hold’em
See the TextArena repository for the full list.

Configuration

Basic Configuration

env = TextArenaEnv(
    game="Wordle-v0",
    num_train_examples=1000,
    num_eval_examples=200,
    seed=0,
)

Custom Parser

By default, TextArena uses XMLParser with <think> and <guess> fields:
custom_parser = vf.XMLParser(
    fields=["reasoning", "action"],
    answer_field="action"
)

env = TextArenaEnv(
    game="Wordle-v0",
    parser=custom_parser,
    num_train_examples=1000,
)

Custom System Prompt

env = TextArenaEnv(
    game="Wordle-v0",
    system_prompt="You are an expert Wordle player. Make strategic guesses based on the feedback.",
    num_train_examples=1000,
)

Custom Feedback Function

TextArena games return full game state, but you may want to render only the delta. Use feedback_fn to transform observations:
def format_feedback(observation: str) -> str:
    """Extract only the latest feedback from full game state."""
    lines = observation.split("\n")
    # Find the most recent guess feedback
    for line in reversed(lines):
        if "Feedback:" in line:
            return line
    return observation

env = TextArenaEnv(
    game="Wordle-v0",
    feedback_fn=format_feedback,
    num_train_examples=1000,
)
Verifiers doesn’t allow overwriting past messages—only appending. TextArena games often return full game state rather than turn-level diffs, so feedback_fn is useful for rendering clean, incremental feedback.

Custom Rubric

By default, the game’s built-in reward is used. Override with a custom rubric:
async def win_bonus(state: vf.State) -> float:
    """Extra reward for winning quickly."""
    if state.get("reward", 0) > 0.9:  # Won the game
        turns = len(state.get("trajectory", []))
        return 1.0 / turns  # More reward for fewer turns
    return 0.0

rubric = vf.Rubric(funcs=[win_bonus])

env = TextArenaEnv(
    game="Wordle-v0",
    rubric=rubric,
    num_train_examples=1000,
)

Full Example

import verifiers as vf
from verifiers.envs.integrations.textarena_env import TextArenaEnv

def render_wordle_feedback(observation: str) -> str:
    """Format Wordle feedback for better readability."""
    lines = observation.split("\n")
    feedback_lines = []
    
    for line in lines:
        if "Guess" in line or "Feedback" in line:
            feedback_lines.append(line)
    
    if not feedback_lines:
        return observation
    
    # Return only the most recent guess and feedback
    return "\n".join(feedback_lines[-2:])

def load_environment(
    game: str = "Wordle-v0",
    num_train_examples: int = 1000,
    num_eval_examples: int = 100,
    seed: int = 0,
) -> vf.Environment:
    """Load a TextArena environment.
    
    Args:
        game: TextArena game ID
        num_train_examples: Number of training examples
        num_eval_examples: Number of eval examples  
        seed: Random seed for word selection
    """
    parser = vf.XMLParser(
        fields=["think", "guess"],
        answer_field="guess"
    )
    
    return TextArenaEnv(
        game=game,
        num_train_examples=num_train_examples,
        num_eval_examples=num_eval_examples,
        parser=parser,
        system_prompt="You are playing Wordle. Think through your strategy, then make a guess.",
        feedback_fn=render_wordle_feedback,
        seed=seed,
    )

Expected Format

Models should respond with XML-formatted guesses:
<think>
Based on the feedback:
- 'A' is in the word but wrong position
- 'E' is not in the word
- 'S' is in the word and correct position

I'll try "STAIN" next.
</think>

<guess>
STAIN
</guess>

Performance Optimization

TextArenaEnv includes memory optimization for parallel rollouts:
shared_memo = TextArenaEnv.build_shared_memo(ta_env)
This shares immutable data (like English dictionary word lists) across environment copies, saving ~38MB and ~120ms per rollout. This is handled automatically.

Game-Specific Notes

Wordle

  • Words are randomly selected from the TextArena word list
  • Default max turns: 6
  • Reward is based on number of guesses (fewer is better)

TwentyQuestions

  • Model asks yes/no questions to guess the object
  • Limited to 20 questions
  • Reward for correct guess within question limit

Chess

  • Moves in algebraic notation (e.g., “e2e4”)
  • Game state includes board representation
  • Reward based on game outcome

Metrics

MetricMeaning
rewardGame reward (task-specific)
num_turnsNumber of turns taken
format_rewardXML format compliance (if parser used)

Best Practices

When wrapping new TextArena games, investigate the source code to understand the observation format. Many games return full state rather than turn-level diffs.
  • Use feedback_fn - Transform full-state observations to incremental feedback
  • Test locally first - Try a few games manually to understand difficulty
  • Validate parsing - Ensure your parser extracts the right fields
  • Custom prompts - Game-specific instructions improve performance
  • Seed consistency - Use same seed for reproducible experiments

Troubleshooting

NLTK Download Errors

TextArena uses NLTK for word games. If you see download errors, the environment handles this automatically. If issues persist:
import nltk
nltk.download('words', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

Invalid Moves

If the model makes invalid moves (e.g., non-existent words in Wordle):
  • Improve the system prompt with game rules
  • Add examples of valid moves in few-shot prompts
  • Use a more capable model

Memory Issues

For large-scale parallel rollouts:
  • The environment automatically shares immutable data
  • If still seeing issues, reduce num_train_examples
  • Consider running evaluation in batches

Examples

See the gem-wordle example in the Verifiers repository for a complete implementation.

Further Reading

Build docs developers (and LLMs) love