Skip to main content
Multi-turn environments enable back-and-forth interaction between the model and the environment. They’re perfect for games, simulations, debugging tasks, and any scenario where the model needs multiple attempts or receives feedback after each action.

Overview

MultiTurnEnv implements the core rollout loop used by all Verifiers environments (even SingleTurnEnv is just a MultiTurnEnv with max_turns=1). Each rollout follows this pattern:
  1. Initialize statesetup_state() prepares per-rollout resources
  2. Loop until done:
    • Get prompt messages (initial prompt or previous conversation + environment response)
    • Get model response
    • Check stop conditions — exit if any @vf.stop method returns True
  3. Render completion — assemble final conversation into state["completion"]
  4. Cleanup — run all @vf.cleanup methods

The Rollout Loop

Here’s the core structure of a multi-turn rollout:
class MultiTurnEnv(vf.Environment):
    async def rollout(self, input, client, model, sampling_args):
        state = await self.init_state(input, client, model, sampling_args)
        
        try:
            state = await self.setup_state(state)  # 1. Initialize
            
            while not await self.is_completed(state):  # 2. Loop
                prompt_messages = await self.get_prompt_messages(state)
                response = await self.get_model_response(state, prompt_messages)
                await self.add_model_response(state, prompt_messages, response)
            
            await self.render_completion(state)  # 3. Finalize
            return state
        finally:
            await self._cleanup(state)  # 4. Cleanup
To build a custom multi-turn environment, you override specific methods:
  • env_response()Required. Define how the environment responds after each model turn
  • setup_state() — Optional. Initialize per-rollout resources
  • @vf.stop methods — Optional. Define custom stop conditions
  • @vf.cleanup methods — Optional. Cleanup resources after each rollout

Building a Custom Environment

Let’s build a simple number guessing game:
1
Define the Environment Class
2
import verifiers as vf
import random

class NumberGuessingEnv(vf.MultiTurnEnv):
    def __init__(self, max_turns: int = 10, **kwargs):
        super().__init__(max_turns=max_turns, **kwargs)
3
Initialize Per-Rollout State
4
class NumberGuessingEnv(vf.MultiTurnEnv):
    async def setup_state(self, state: vf.State) -> vf.State:
        # Pick a random number for this rollout
        state["target_number"] = random.randint(1, 100)
        state["attempts"] = 0
        return await super().setup_state(state)
5
Implement Environment Response
6
The env_response() method defines what happens after each model turn:
7
class NumberGuessingEnv(vf.MultiTurnEnv):
    async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
        """Process the guess and return feedback."""
        # Extract the guess from the model's response
        last_message = messages[-1]["content"]
        
        try:
            guess = int(last_message.strip())
        except ValueError:
            return [{"role": "user", "content": "Please provide a number."}]
        
        state["attempts"] += 1
        target = state["target_number"]
        
        if guess == target:
            state["won"] = True
            return [{"role": "user", "content": f"Correct! The number was {target}."}]
        elif guess < target:
            return [{"role": "user", "content": "Too low. Try again."}]
        else:
            return [{"role": "user", "content": "Too high. Try again."}]
8
Add Stop Conditions
9
Define when the rollout should end:
10
class NumberGuessingEnv(vf.MultiTurnEnv):
    @vf.stop
    async def game_won(self, state: vf.State) -> bool:
        return state.get("won", False)
11
Built-in stop conditions:
12
  • has_error — stops if state["error"] is set
  • max_turns_reached — stops after max_turns iterations
  • prompt_too_long — stops if prompt exceeds model context
  • 13
    Create Dataset and Rubric
    14
    from datasets import Dataset
    
    def load_environment():
        # Each row is one game instance
        dataset = Dataset.from_list([
            {"prompt": [{"role": "user", "content": "Guess a number between 1 and 100."}]}
            for _ in range(100)
        ])
        
        # Reward function
        async def won_game(state) -> float:
            return 1.0 if state.get("won", False) else 0.0
        
        async def efficiency_bonus(state) -> float:
            if not state.get("won", False):
                return 0.0
            attempts = state.get("attempts", 10)
            return max(0.0, 1.0 - (attempts / 10))  # Bonus for fewer attempts
        
        rubric = vf.Rubric(
            funcs=[won_game, efficiency_bonus],
            weights=[1.0, 0.5]
        )
        
        return NumberGuessingEnv(dataset=dataset, rubric=rubric, max_turns=10)
    

    Real Example: Wordle

    Let’s examine the wordle environment from the repository:
    environments/wordle/wordle.py
    import re
    import verifiers as vf
    from verifiers.envs.integrations.textarena_env import TextArenaEnv
    
    DEFAULT_SYSTEM_PROMPT = """You are a competitive game player. \
    Make sure you read the game instructions carefully, and always follow the required format.
    
    In each turn, think step-by-step, then give your guess inside <guess>...</guess> tags."""
    
    def wordle_feedback_fn(observation: str) -> str:
        """Extract just the latest feedback from the game state."""
        latest_observation = observation.split("[GAME]")[-1].strip()
        if "Feedback:" in latest_observation:
            return latest_observation.split("Feedback:")[-1]
        else:
            return latest_observation
    
    def correct_answer(parser, completion, answer, **kwargs) -> float:
        """Whether the guess is *exactly* correct."""
        guess = parser.parse_answer(completion)
        return 1.0 if guess == "[" + answer + "]" else 0.0
    
    def length_bonus(parser, completion, answer, **kwargs) -> float:
        """Bonus for shorter correct solutions."""
        assistant_messages = parser.get_assistant_messages(completion)
        guesses = [x for x in assistant_messages if re.search(r"<guess>.*</guess>", x["content"])]
        is_correct = correct_answer(parser, completion, answer, **kwargs)
        return is_correct / (len(guesses) or 1)
    
    def load_environment(
        num_train_examples: int = 2000,
        num_eval_examples: int = 20,
        system_prompt: str = DEFAULT_SYSTEM_PROMPT,
        seed: int = 0,
        **kwargs,
    ):
        parser = vf.XMLParser(fields=["guess"], answer_field="guess")
        
        rubric = vf.Rubric(parser=parser)
        rubric.add_reward_func(correct_answer)
        rubric.add_reward_func(length_bonus)
        
        return TextArenaEnv(
            game="Wordle-v0",
            num_train_examples=num_train_examples,
            num_eval_examples=num_eval_examples,
            feedback_fn=wordle_feedback_fn,
            seed=seed,
            system_prompt=system_prompt,
            parser=parser,
            rubric=rubric,
            **kwargs,
        )
    
    Key features:
    • Wraps a TextArena game environment
    • Uses XMLParser to extract guesses from structured output
    • Custom feedback_fn cleans up the game state for the model
    • Multiple reward functions: correctness + efficiency bonus

    Advanced Patterns

    Custom Stop Conditions

    Control when rollouts end with @vf.stop decorators:
    class MyGameEnv(vf.MultiTurnEnv):
        @vf.stop
        async def game_won(self, state: vf.State) -> bool:
            return state.get("won", False)
        
        @vf.stop
        async def game_lost(self, state: vf.State) -> bool:
            return state.get("lives", 3) <= 0
        
        @vf.stop(priority=10)  # Check this first
        async def answer_submitted(self, state: vf.State) -> bool:
            completion = state.get("completion", [])
            if not completion:
                return False
            return "FINAL ANSWER:" in completion[-1].get("content", "")
    
    Priority ordering (higher runs first) lets you check cheap conditions before expensive ones.

    Early Termination from env_response

    Signal completion directly from the environment response:
    class MyGameEnv(vf.MultiTurnEnv):
        async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
            if check_game_over(state):
                final_message = [
                    {"role": "user", "content": f"Game over! Final score: {state['score']}"}
                ]
                state["final_env_response"] = final_message
                return final_message
            
            # Normal game continues...
            return process_turn(messages, state)
    
    Setting state["final_env_response"] bypasses the model response loop and terminates immediately.

    Cleanup and Resource Management

    Use decorators for proper resource cleanup:
    class MyGameEnv(vf.MultiTurnEnv):
        @vf.cleanup
        async def save_game_log(self, state: vf.State):
            """Called after each rollout completes."""
            await log_game_result(state["game_id"], state["score"])
        
        @vf.teardown
        async def close_connections(self):
            """Called once when environment shuts down."""
            await self.db_connection.close()
    
    Important: Cleanup methods should be idempotent (safe to call multiple times) and handle errors gracefully. This ensures correct behavior when rollouts are cancelled or interrupted.

    Custom Message Assembly

    Override get_prompt_messages() for non-linear conversations:
    class MyGameEnv(vf.MultiTurnEnv):
        async def get_prompt_messages(self, state: vf.State) -> vf.Messages:
            if len(state["trajectory"]) == 0:
                # First turn: return initial prompt
                return state["prompt"]
            
            # Subsequent turns: reconstruct conversation with game state
            messages = []
            messages.append({"role": "system", "content": self.system_prompt})
            
            for turn in state["trajectory"]:
                messages.extend(turn["completion"])
            
            # Add environment response
            env_response = await self.env_response(messages, state)
            messages.extend(env_response)
            
            return messages
    

    Trajectory Tracking

    Add metadata to each turn:
    class MyGameEnv(vf.MultiTurnEnv):
        async def add_trajectory_step(self, state: vf.State, trajectory_step):
            """Add custom metadata to each turn."""
            trajectory_step["extras"]["board_state"] = state["board"].copy()
            trajectory_step["extras"]["valid_moves"] = state["valid_moves"]
            await super().add_trajectory_step(state, trajectory_step)
    

    Error Handling

    Verifiers provides a hierarchy of error types under vf.Error:
    vf.ModelError           # Model interaction errors
    vf.OverlongPromptError  # Prompt exceeds context length
    vf.ToolError            # Tool-related errors
    vf.InfraError           # Infrastructure errors (e.g., sandbox)
    
    When a vf.Error is raised during a rollout:
    1. It’s caught automatically
    2. Stored in state["error"]
    3. The built-in has_error stop condition triggers
    4. The rollout terminates gracefully
    Example:
    class MyGameEnv(vf.MultiTurnEnv):
        async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
            try:
                result = await self.external_api.call(messages)
                return [{"role": "user", "content": result}]
            except ExternalAPIError as e:
                raise vf.InfraError(f"API call failed: {e}") from e
    

    Monitor Rubrics

    Track environment-specific metrics automatically:
    class MyMonitorRubric(vf.Rubric):
        def __init__(self):
            super().__init__()
            self.add_metric(self.average_score)
            self.add_metric(self.total_moves)
        
        async def average_score(self, state: vf.State) -> float:
            turns = len(state["trajectory"])
            total_score = state.get("score", 0)
            return total_score / max(turns, 1)
        
        async def total_moves(self, state: vf.State) -> float:
            return float(len(state["trajectory"]))
    
    class MyGameEnv(vf.MultiTurnEnv):
        def __init__(self, **kwargs):
            super().__init__(**kwargs)
            self.add_rubric(MyMonitorRubric())
    
    MultiTurnEnv automatically tracks num_turns for all multi-turn environments.

    Testing Your Environment

    1
    Install and run a quick test
    2
    prime env install my-game-env
    prime eval run my-game-env -m gpt-4.1-mini -n 5 -r 3
    
    3
    Check metrics
    4
    Expected output:
    5
    Loading environment: my-game-env
    Running 5 examples × 3 rollouts = 15 total rollouts
    Progress: ████████████████████ 15/15 (100%)
    
    Results:
      Reward: 0.67 ± 0.21
      won_game: 0.67 ± 0.47
      efficiency_bonus: 0.23 ± 0.18
      num_turns: 6.2 ± 2.1
    
    6
    Debug with verbose mode
    7
    prime eval run my-game-env -m gpt-4.1-mini -n 2 -v
    
    8
    Shows detailed logs including:
    9
  • Model requests and responses
  • Environment responses
  • State updates
  • Stop condition checks
  • 10
    Save detailed results
    11
    prime eval run my-game-env -m gpt-4.1-mini -n 10 -s -C "attempts,won,target_number"
    
    12
    Saves results to ./environments/my_game_env/outputs/evals/ including custom state columns.

    Common Pitfalls

    Don’t override rollout() — The base implementation handles the core loop correctly. Override specific methods like env_response(), setup_state(), and stop conditions instead.
    Return new messages, don’t mutateenv_response() should return a list of new messages to append, not modify existing messages.
    Make cleanup idempotent — Cleanup methods may be called multiple times or when resources are in unexpected states. Handle errors gracefully.

    Next Steps

    Build docs developers (and LLMs) love