Building Multi-Turn Environments

Multi-turn environments enable back-and-forth interaction between the model and the environment. They’re perfect for games, simulations, debugging tasks, and any scenario where the model needs multiple attempts or receives feedback after each action.

Overview

MultiTurnEnv implements the core rollout loop used by all Verifiers environments (even SingleTurnEnv is just a MultiTurnEnv with max_turns=1). Each rollout follows this pattern:

Initialize state — setup_state() prepares per-rollout resources
Loop until done:
- Get prompt messages (initial prompt or previous conversation + environment response)
- Get model response
- Check stop conditions — exit if any @vf.stop method returns True
Render completion — assemble final conversation into state["completion"]
Cleanup — run all @vf.cleanup methods

The Rollout Loop

Here’s the core structure of a multi-turn rollout:

class MultiTurnEnv(vf.Environment):
    async def rollout(self, input, client, model, sampling_args):
        state = await self.init_state(input, client, model, sampling_args)
        
        try:
            state = await self.setup_state(state)  # 1. Initialize
            
            while not await self.is_completed(state):  # 2. Loop
                prompt_messages = await self.get_prompt_messages(state)
                response = await self.get_model_response(state, prompt_messages)
                await self.add_model_response(state, prompt_messages, response)
            
            await self.render_completion(state)  # 3. Finalize
            return state
        finally:
            await self._cleanup(state)  # 4. Cleanup

To build a custom multi-turn environment, you override specific methods:

env_response() — Required. Define how the environment responds after each model turn
setup_state() — Optional. Initialize per-rollout resources
@vf.stop methods — Optional. Define custom stop conditions
@vf.cleanup methods — Optional. Cleanup resources after each rollout

Building a Custom Environment

Let’s build a simple number guessing game:

Define the Environment Class

import verifiers as vf
import random

class NumberGuessingEnv(vf.MultiTurnEnv):
    def __init__(self, max_turns: int = 10, **kwargs):
        super().__init__(max_turns=max_turns, **kwargs)

Initialize Per-Rollout State

class NumberGuessingEnv(vf.MultiTurnEnv):
    async def setup_state(self, state: vf.State) -> vf.State:
        # Pick a random number for this rollout
        state["target_number"] = random.randint(1, 100)
        state["attempts"] = 0
        return await super().setup_state(state)

Implement Environment Response

The env_response() method defines what happens after each model turn:

class NumberGuessingEnv(vf.MultiTurnEnv):
    async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
        """Process the guess and return feedback."""
        # Extract the guess from the model's response
        last_message = messages[-1]["content"]
        
        try:
            guess = int(last_message.strip())
        except ValueError:
            return [{"role": "user", "content": "Please provide a number."}]
        
        state["attempts"] += 1
        target = state["target_number"]
        
        if guess == target:
            state["won"] = True
            return [{"role": "user", "content": f"Correct! The number was {target}."}]
        elif guess < target:
            return [{"role": "user", "content": "Too low. Try again."}]
        else:
            return [{"role": "user", "content": "Too high. Try again."}]

Add Stop Conditions

Define when the rollout should end:

class NumberGuessingEnv(vf.MultiTurnEnv):
    @vf.stop
    async def game_won(self, state: vf.State) -> bool:
        return state.get("won", False)

Built-in stop conditions:

has_error — stops if state["error"] is set

max_turns_reached — stops after max_turns iterations

prompt_too_long — stops if prompt exceeds model context

Create Dataset and Rubric

from datasets import Dataset

def load_environment():
    # Each row is one game instance
    dataset = Dataset.from_list([
        {"prompt": [{"role": "user", "content": "Guess a number between 1 and 100."}]}
        for _ in range(100)
    ])
    
    # Reward function
    async def won_game(state) -> float:
        return 1.0 if state.get("won", False) else 0.0
    
    async def efficiency_bonus(state) -> float:
        if not state.get("won", False):
            return 0.0
        attempts = state.get("attempts", 10)
        return max(0.0, 1.0 - (attempts / 10))  # Bonus for fewer attempts
    
    rubric = vf.Rubric(
        funcs=[won_game, efficiency_bonus],
        weights=[1.0, 0.5]
    )
    
    return NumberGuessingEnv(dataset=dataset, rubric=rubric, max_turns=10)

Real Example: Wordle

Let’s examine the wordle environment from the repository:

environments/wordle/wordle.py

import re
import verifiers as vf
from verifiers.envs.integrations.textarena_env import TextArenaEnv

DEFAULT_SYSTEM_PROMPT = """You are a competitive game player. \
Make sure you read the game instructions carefully, and always follow the required format.

In each turn, think step-by-step, then give your guess inside <guess>...</guess> tags."""

def wordle_feedback_fn(observation: str) -> str:
    """Extract just the latest feedback from the game state."""
    latest_observation = observation.split("[GAME]")[-1].strip()
    if "Feedback:" in latest_observation:
        return latest_observation.split("Feedback:")[-1]
    else:
        return latest_observation

def correct_answer(parser, completion, answer, **kwargs) -> float:
    """Whether the guess is *exactly* correct."""
    guess = parser.parse_answer(completion)
    return 1.0 if guess == "[" + answer + "]" else 0.0

def length_bonus(parser, completion, answer, **kwargs) -> float:
    """Bonus for shorter correct solutions."""
    assistant_messages = parser.get_assistant_messages(completion)
    guesses = [x for x in assistant_messages if re.search(r"<guess>.*</guess>", x["content"])]
    is_correct = correct_answer(parser, completion, answer, **kwargs)
    return is_correct / (len(guesses) or 1)

def load_environment(
    num_train_examples: int = 2000,
    num_eval_examples: int = 20,
    system_prompt: str = DEFAULT_SYSTEM_PROMPT,
    seed: int = 0,
    **kwargs,
):
    parser = vf.XMLParser(fields=["guess"], answer_field="guess")
    
    rubric = vf.Rubric(parser=parser)
    rubric.add_reward_func(correct_answer)
    rubric.add_reward_func(length_bonus)
    
    return TextArenaEnv(
        game="Wordle-v0",
        num_train_examples=num_train_examples,
        num_eval_examples=num_eval_examples,
        feedback_fn=wordle_feedback_fn,
        seed=seed,
        system_prompt=system_prompt,
        parser=parser,
        rubric=rubric,
        **kwargs,
    )

Key features:

Wraps a TextArena game environment
Uses XMLParser to extract guesses from structured output
Custom feedback_fn cleans up the game state for the model
Multiple reward functions: correctness + efficiency bonus

Advanced Patterns

Custom Stop Conditions

Control when rollouts end with @vf.stop decorators:

class MyGameEnv(vf.MultiTurnEnv):
    @vf.stop
    async def game_won(self, state: vf.State) -> bool:
        return state.get("won", False)
    
    @vf.stop
    async def game_lost(self, state: vf.State) -> bool:
        return state.get("lives", 3) <= 0
    
    @vf.stop(priority=10)  # Check this first
    async def answer_submitted(self, state: vf.State) -> bool:
        completion = state.get("completion", [])
        if not completion:
            return False
        return "FINAL ANSWER:" in completion[-1].get("content", "")

Priority ordering (higher runs first) lets you check cheap conditions before expensive ones.

Early Termination from env_response

Signal completion directly from the environment response:

class MyGameEnv(vf.MultiTurnEnv):
    async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
        if check_game_over(state):
            final_message = [
                {"role": "user", "content": f"Game over! Final score: {state['score']}"}
            ]
            state["final_env_response"] = final_message
            return final_message
        
        # Normal game continues...
        return process_turn(messages, state)

Setting state["final_env_response"] bypasses the model response loop and terminates immediately.

Cleanup and Resource Management

Use decorators for proper resource cleanup:

class MyGameEnv(vf.MultiTurnEnv):
    @vf.cleanup
    async def save_game_log(self, state: vf.State):
        """Called after each rollout completes."""
        await log_game_result(state["game_id"], state["score"])
    
    @vf.teardown
    async def close_connections(self):
        """Called once when environment shuts down."""
        await self.db_connection.close()

Important: Cleanup methods should be idempotent (safe to call multiple times) and handle errors gracefully. This ensures correct behavior when rollouts are cancelled or interrupted.

Custom Message Assembly

Override get_prompt_messages() for non-linear conversations:

class MyGameEnv(vf.MultiTurnEnv):
    async def get_prompt_messages(self, state: vf.State) -> vf.Messages:
        if len(state["trajectory"]) == 0:
            # First turn: return initial prompt
            return state["prompt"]
        
        # Subsequent turns: reconstruct conversation with game state
        messages = []
        messages.append({"role": "system", "content": self.system_prompt})
        
        for turn in state["trajectory"]:
            messages.extend(turn["completion"])
        
        # Add environment response
        env_response = await self.env_response(messages, state)
        messages.extend(env_response)
        
        return messages

Trajectory Tracking

Add metadata to each turn:

class MyGameEnv(vf.MultiTurnEnv):
    async def add_trajectory_step(self, state: vf.State, trajectory_step):
        """Add custom metadata to each turn."""
        trajectory_step["extras"]["board_state"] = state["board"].copy()
        trajectory_step["extras"]["valid_moves"] = state["valid_moves"]
        await super().add_trajectory_step(state, trajectory_step)

Error Handling

Verifiers provides a hierarchy of error types under vf.Error:

vf.ModelError           # Model interaction errors
vf.OverlongPromptError  # Prompt exceeds context length
vf.ToolError            # Tool-related errors
vf.InfraError           # Infrastructure errors (e.g., sandbox)

When a vf.Error is raised during a rollout:

It’s caught automatically
Stored in state["error"]
The built-in has_error stop condition triggers
The rollout terminates gracefully

Example:

class MyGameEnv(vf.MultiTurnEnv):
    async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
        try:
            result = await self.external_api.call(messages)
            return [{"role": "user", "content": result}]
        except ExternalAPIError as e:
            raise vf.InfraError(f"API call failed: {e}") from e

Monitor Rubrics

Track environment-specific metrics automatically:

class MyMonitorRubric(vf.Rubric):
    def __init__(self):
        super().__init__()
        self.add_metric(self.average_score)
        self.add_metric(self.total_moves)
    
    async def average_score(self, state: vf.State) -> float:
        turns = len(state["trajectory"])
        total_score = state.get("score", 0)
        return total_score / max(turns, 1)
    
    async def total_moves(self, state: vf.State) -> float:
        return float(len(state["trajectory"]))

class MyGameEnv(vf.MultiTurnEnv):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.add_rubric(MyMonitorRubric())

MultiTurnEnv automatically tracks num_turns for all multi-turn environments.

Testing Your Environment

Install and run a quick test

prime env install my-game-env
prime eval run my-game-env -m gpt-4.1-mini -n 5 -r 3

Check metrics

Expected output:

Loading environment: my-game-env
Running 5 examples × 3 rollouts = 15 total rollouts
Progress: ████████████████████ 15/15 (100%)

Results:
  Reward: 0.67 ± 0.21
  won_game: 0.67 ± 0.47
  efficiency_bonus: 0.23 ± 0.18
  num_turns: 6.2 ± 2.1

Debug with verbose mode

prime eval run my-game-env -m gpt-4.1-mini -n 2 -v

Shows detailed logs including:

Model requests and responses

Environment responses

State updates

Stop condition checks

Save detailed results

prime eval run my-game-env -m gpt-4.1-mini -n 10 -s -C "attempts,won,target_number"

Saves results to ./environments/my_game_env/outputs/evals/ including custom state columns.

Common Pitfalls

Don’t override rollout() — The base implementation handles the core loop correctly. Override specific methods like env_response(), setup_state(), and stop conditions instead.

Return new messages, don’t mutate — env_response() should return a list of new messages to append, not modify existing messages.

Make cleanup idempotent — Cleanup methods may be called multiple times or when resources are in unexpected states. Handle errors gracefully.

Next Steps

Add tools: Give your environment access to external functions → Tool Environments Guide
Custom patterns: Advanced multi-turn patterns → Custom Environments Guide
Training: Use your environment for RL training → Training Guide

Get Started

Core Concepts

Guides

Integrations

Building Multi-Turn Environments

Overview

The Rollout Loop

Building a Custom Environment

Real Example: Wordle

Advanced Patterns

Custom Stop Conditions

Early Termination from env_response

Cleanup and Resource Management

Custom Message Assembly

Trajectory Tracking

Error Handling

Monitor Rubrics

Testing Your Environment

Common Pitfalls

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

​Overview

​The Rollout Loop

​Building a Custom Environment

​Real Example: Wordle

​Advanced Patterns

​Custom Stop Conditions

​Early Termination from env_response

​Cleanup and Resource Management

​Custom Message Assembly

​Trajectory Tracking

​Error Handling

​Monitor Rubrics

​Testing Your Environment

​Common Pitfalls

​Next Steps

Build docs developers (and LLMs) love

Overview

The Rollout Loop

Building a Custom Environment

Real Example: Wordle

Advanced Patterns

Custom Stop Conditions

Early Termination from env_response

Cleanup and Resource Management

Custom Message Assembly

Trajectory Tracking

Error Handling

Monitor Rubrics

Testing Your Environment

Common Pitfalls

Next Steps