TextArena Integration

The TextArenaEnv integration wraps TextArena text-based game environments for multi-turn interaction with language models. TextArena provides competitive and collaborative text-based games designed for LLM evaluation.

Features

Text-based games - Wordle, 20 Questions, Poker, and more
Multi-turn interaction - Games require multiple model responses
Efficient memory sharing - Optimized for parallel rollouts
Custom feedback - Transform game observations for better prompting
XML formatting - Built-in parser for structured responses

Installation

Install with TextArena support:

uv add 'verifiers[ta]'

This installs:

textarena - TextArena game library
nltk - Natural language processing (for word games)

Quick Start

Create environment

Create a basic Wordle environment:

import verifiers as vf
from verifiers.envs.integrations.textarena_env import TextArenaEnv

def load_environment():
    return TextArenaEnv(
        game="Wordle-v0",
        num_train_examples=1000,
        num_eval_examples=100,
        seed=0,
    )

Evaluate

Run an evaluation:

prime eval run my-wordle-env -m openai/gpt-4.1-mini -n 20

Available Games

TextArena provides several game types:

Word Games

Wordle-v0 - Classic Wordle game
WordChain-v0 - Word association chains
Scrabble-v0 - Scrabble with simplified rules

Logic Games

TwentyQuestions-v0 - Guess the object
Mastermind-v0 - Code-breaking game

Strategy Games

Chess-v0 - Text-based chess
Go-v0 - Text-based Go
Poker-v0 - Texas Hold’em

See the TextArena repository for the full list.

Configuration

Basic Configuration

env = TextArenaEnv(
    game="Wordle-v0",
    num_train_examples=1000,
    num_eval_examples=200,
    seed=0,
)

Custom Parser

By default, TextArena uses XMLParser with <think> and <guess> fields:

custom_parser = vf.XMLParser(
    fields=["reasoning", "action"],
    answer_field="action"
)

env = TextArenaEnv(
    game="Wordle-v0",
    parser=custom_parser,
    num_train_examples=1000,
)

Custom System Prompt

env = TextArenaEnv(
    game="Wordle-v0",
    system_prompt="You are an expert Wordle player. Make strategic guesses based on the feedback.",
    num_train_examples=1000,
)

Custom Feedback Function

TextArena games return full game state, but you may want to render only the delta. Use feedback_fn to transform observations:

def format_feedback(observation: str) -> str:
    """Extract only the latest feedback from full game state."""
    lines = observation.split("\n")
    # Find the most recent guess feedback
    for line in reversed(lines):
        if "Feedback:" in line:
            return line
    return observation

env = TextArenaEnv(
    game="Wordle-v0",
    feedback_fn=format_feedback,
    num_train_examples=1000,
)

Verifiers doesn’t allow overwriting past messages—only appending. TextArena games often return full game state rather than turn-level diffs, so feedback_fn is useful for rendering clean, incremental feedback.

Custom Rubric

By default, the game’s built-in reward is used. Override with a custom rubric:

async def win_bonus(state: vf.State) -> float:
    """Extra reward for winning quickly."""
    if state.get("reward", 0) > 0.9:  # Won the game
        turns = len(state.get("trajectory", []))
        return 1.0 / turns  # More reward for fewer turns
    return 0.0

rubric = vf.Rubric(funcs=[win_bonus])

env = TextArenaEnv(
    game="Wordle-v0",
    rubric=rubric,
    num_train_examples=1000,
)

Full Example

import verifiers as vf
from verifiers.envs.integrations.textarena_env import TextArenaEnv

def render_wordle_feedback(observation: str) -> str:
    """Format Wordle feedback for better readability."""
    lines = observation.split("\n")
    feedback_lines = []
    
    for line in lines:
        if "Guess" in line or "Feedback" in line:
            feedback_lines.append(line)
    
    if not feedback_lines:
        return observation
    
    # Return only the most recent guess and feedback
    return "\n".join(feedback_lines[-2:])

def load_environment(
    game: str = "Wordle-v0",
    num_train_examples: int = 1000,
    num_eval_examples: int = 100,
    seed: int = 0,
) -> vf.Environment:
    """Load a TextArena environment.
    
    Args:
        game: TextArena game ID
        num_train_examples: Number of training examples
        num_eval_examples: Number of eval examples  
        seed: Random seed for word selection
    """
    parser = vf.XMLParser(
        fields=["think", "guess"],
        answer_field="guess"
    )
    
    return TextArenaEnv(
        game=game,
        num_train_examples=num_train_examples,
        num_eval_examples=num_eval_examples,
        parser=parser,
        system_prompt="You are playing Wordle. Think through your strategy, then make a guess.",
        feedback_fn=render_wordle_feedback,
        seed=seed,
    )

Expected Format

Models should respond with XML-formatted guesses:

<think>
Based on the feedback:
- 'A' is in the word but wrong position
- 'E' is not in the word
- 'S' is in the word and correct position

I'll try "STAIN" next.
</think>

<guess>
STAIN
</guess>

Performance Optimization

TextArenaEnv includes memory optimization for parallel rollouts:

shared_memo = TextArenaEnv.build_shared_memo(ta_env)

This shares immutable data (like English dictionary word lists) across environment copies, saving ~38MB and ~120ms per rollout. This is handled automatically.

Game-Specific Notes

Wordle

Words are randomly selected from the TextArena word list
Default max turns: 6
Reward is based on number of guesses (fewer is better)

TwentyQuestions

Model asks yes/no questions to guess the object
Limited to 20 questions
Reward for correct guess within question limit

Chess

Moves in algebraic notation (e.g., “e2e4”)
Game state includes board representation
Reward based on game outcome

Metrics

Metric	Meaning
`reward`	Game reward (task-specific)
`num_turns`	Number of turns taken
`format_reward`	XML format compliance (if parser used)

Best Practices

When wrapping new TextArena games, investigate the source code to understand the observation format. Many games return full state rather than turn-level diffs.

Use feedback_fn - Transform full-state observations to incremental feedback
Test locally first - Try a few games manually to understand difficulty
Validate parsing - Ensure your parser extracts the right fields
Custom prompts - Game-specific instructions improve performance
Seed consistency - Use same seed for reproducible experiments

Troubleshooting

NLTK Download Errors

TextArena uses NLTK for word games. If you see download errors, the environment handles this automatically. If issues persist:

import nltk
nltk.download('words', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

Invalid Moves

If the model makes invalid moves (e.g., non-existent words in Wordle):

Improve the system prompt with game rules
Add examples of valid moves in few-shot prompts
Use a more capable model

Memory Issues

For large-scale parallel rollouts:

The environment automatically shares immutable data
If still seeing issues, reduce num_train_examples
Consider running evaluation in batches

Examples

See the gem-wordle example in the Verifiers repository for a complete implementation.

Get Started

Core Concepts

Guides

Integrations

TextArena Integration

Features

Installation

Quick Start

Available Games

Word Games

Logic Games

Strategy Games

Configuration

Basic Configuration

Custom Parser

Custom System Prompt

Custom Feedback Function

Custom Rubric

Full Example

Expected Format

Performance Optimization

Game-Specific Notes

Wordle

TwentyQuestions

Chess

Metrics

Best Practices

Troubleshooting

NLTK Download Errors

Invalid Moves

Memory Issues

Examples

Further Reading

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

​Features

​Installation

​Quick Start

​Available Games

​Word Games

​Logic Games

​Strategy Games

​Configuration

​Basic Configuration

​Custom Parser

​Custom System Prompt

​Custom Feedback Function

​Custom Rubric

​Full Example

​Expected Format

​Performance Optimization

​Game-Specific Notes

​Wordle

​TwentyQuestions

​Chess

​Metrics

​Best Practices

​Troubleshooting

​NLTK Download Errors

​Invalid Moves

​Memory Issues

​Examples

​Further Reading

Build docs developers (and LLMs) love

Features

Installation

Quick Start

Available Games

Word Games

Logic Games

Strategy Games

Configuration

Basic Configuration

Custom Parser

Custom System Prompt

Custom Feedback Function

Custom Rubric

Full Example

Expected Format

Performance Optimization

Game-Specific Notes

Wordle

TwentyQuestions

Chess

Metrics

Best Practices

Troubleshooting

NLTK Download Errors

Invalid Moves

Memory Issues

Examples

Further Reading