Multi-turn environments enable back-and-forth interaction between the model and the environment. They’re perfect for games, simulations, debugging tasks, and any scenario where the model needs multiple attempts or receives feedback after each action.
Overview
MultiTurnEnv implements the core rollout loop used by all Verifiers environments (even SingleTurnEnv is just a MultiTurnEnv with max_turns=1). Each rollout follows this pattern:
- Initialize state —
setup_state() prepares per-rollout resources
- Loop until done:
- Get prompt messages (initial prompt or previous conversation + environment response)
- Get model response
- Check stop conditions — exit if any
@vf.stop method returns True
- Render completion — assemble final conversation into
state["completion"]
- Cleanup — run all
@vf.cleanup methods
The Rollout Loop
Here’s the core structure of a multi-turn rollout:
class MultiTurnEnv(vf.Environment):
async def rollout(self, input, client, model, sampling_args):
state = await self.init_state(input, client, model, sampling_args)
try:
state = await self.setup_state(state) # 1. Initialize
while not await self.is_completed(state): # 2. Loop
prompt_messages = await self.get_prompt_messages(state)
response = await self.get_model_response(state, prompt_messages)
await self.add_model_response(state, prompt_messages, response)
await self.render_completion(state) # 3. Finalize
return state
finally:
await self._cleanup(state) # 4. Cleanup
To build a custom multi-turn environment, you override specific methods:
env_response() — Required. Define how the environment responds after each model turn
setup_state() — Optional. Initialize per-rollout resources
@vf.stop methods — Optional. Define custom stop conditions
@vf.cleanup methods — Optional. Cleanup resources after each rollout
Building a Custom Environment
Let’s build a simple number guessing game:
Define the Environment Class
import verifiers as vf
import random
class NumberGuessingEnv(vf.MultiTurnEnv):
def __init__(self, max_turns: int = 10, **kwargs):
super().__init__(max_turns=max_turns, **kwargs)
Initialize Per-Rollout State
class NumberGuessingEnv(vf.MultiTurnEnv):
async def setup_state(self, state: vf.State) -> vf.State:
# Pick a random number for this rollout
state["target_number"] = random.randint(1, 100)
state["attempts"] = 0
return await super().setup_state(state)
Implement Environment Response
The env_response() method defines what happens after each model turn:
class NumberGuessingEnv(vf.MultiTurnEnv):
async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
"""Process the guess and return feedback."""
# Extract the guess from the model's response
last_message = messages[-1]["content"]
try:
guess = int(last_message.strip())
except ValueError:
return [{"role": "user", "content": "Please provide a number."}]
state["attempts"] += 1
target = state["target_number"]
if guess == target:
state["won"] = True
return [{"role": "user", "content": f"Correct! The number was {target}."}]
elif guess < target:
return [{"role": "user", "content": "Too low. Try again."}]
else:
return [{"role": "user", "content": "Too high. Try again."}]
Define when the rollout should end:
class NumberGuessingEnv(vf.MultiTurnEnv):
@vf.stop
async def game_won(self, state: vf.State) -> bool:
return state.get("won", False)
Built-in stop conditions:
has_error — stops if state["error"] is set
max_turns_reached — stops after max_turns iterations
prompt_too_long — stops if prompt exceeds model context
Create Dataset and Rubric
from datasets import Dataset
def load_environment():
# Each row is one game instance
dataset = Dataset.from_list([
{"prompt": [{"role": "user", "content": "Guess a number between 1 and 100."}]}
for _ in range(100)
])
# Reward function
async def won_game(state) -> float:
return 1.0 if state.get("won", False) else 0.0
async def efficiency_bonus(state) -> float:
if not state.get("won", False):
return 0.0
attempts = state.get("attempts", 10)
return max(0.0, 1.0 - (attempts / 10)) # Bonus for fewer attempts
rubric = vf.Rubric(
funcs=[won_game, efficiency_bonus],
weights=[1.0, 0.5]
)
return NumberGuessingEnv(dataset=dataset, rubric=rubric, max_turns=10)
Real Example: Wordle
Let’s examine the wordle environment from the repository:
environments/wordle/wordle.py
import re
import verifiers as vf
from verifiers.envs.integrations.textarena_env import TextArenaEnv
DEFAULT_SYSTEM_PROMPT = """You are a competitive game player. \
Make sure you read the game instructions carefully, and always follow the required format.
In each turn, think step-by-step, then give your guess inside <guess>...</guess> tags."""
def wordle_feedback_fn(observation: str) -> str:
"""Extract just the latest feedback from the game state."""
latest_observation = observation.split("[GAME]")[-1].strip()
if "Feedback:" in latest_observation:
return latest_observation.split("Feedback:")[-1]
else:
return latest_observation
def correct_answer(parser, completion, answer, **kwargs) -> float:
"""Whether the guess is *exactly* correct."""
guess = parser.parse_answer(completion)
return 1.0 if guess == "[" + answer + "]" else 0.0
def length_bonus(parser, completion, answer, **kwargs) -> float:
"""Bonus for shorter correct solutions."""
assistant_messages = parser.get_assistant_messages(completion)
guesses = [x for x in assistant_messages if re.search(r"<guess>.*</guess>", x["content"])]
is_correct = correct_answer(parser, completion, answer, **kwargs)
return is_correct / (len(guesses) or 1)
def load_environment(
num_train_examples: int = 2000,
num_eval_examples: int = 20,
system_prompt: str = DEFAULT_SYSTEM_PROMPT,
seed: int = 0,
**kwargs,
):
parser = vf.XMLParser(fields=["guess"], answer_field="guess")
rubric = vf.Rubric(parser=parser)
rubric.add_reward_func(correct_answer)
rubric.add_reward_func(length_bonus)
return TextArenaEnv(
game="Wordle-v0",
num_train_examples=num_train_examples,
num_eval_examples=num_eval_examples,
feedback_fn=wordle_feedback_fn,
seed=seed,
system_prompt=system_prompt,
parser=parser,
rubric=rubric,
**kwargs,
)
Key features:
- Wraps a TextArena game environment
- Uses
XMLParser to extract guesses from structured output
- Custom
feedback_fn cleans up the game state for the model
- Multiple reward functions: correctness + efficiency bonus
Advanced Patterns
Custom Stop Conditions
Control when rollouts end with @vf.stop decorators:
class MyGameEnv(vf.MultiTurnEnv):
@vf.stop
async def game_won(self, state: vf.State) -> bool:
return state.get("won", False)
@vf.stop
async def game_lost(self, state: vf.State) -> bool:
return state.get("lives", 3) <= 0
@vf.stop(priority=10) # Check this first
async def answer_submitted(self, state: vf.State) -> bool:
completion = state.get("completion", [])
if not completion:
return False
return "FINAL ANSWER:" in completion[-1].get("content", "")
Priority ordering (higher runs first) lets you check cheap conditions before expensive ones.
Early Termination from env_response
Signal completion directly from the environment response:
class MyGameEnv(vf.MultiTurnEnv):
async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
if check_game_over(state):
final_message = [
{"role": "user", "content": f"Game over! Final score: {state['score']}"}
]
state["final_env_response"] = final_message
return final_message
# Normal game continues...
return process_turn(messages, state)
Setting state["final_env_response"] bypasses the model response loop and terminates immediately.
Cleanup and Resource Management
Use decorators for proper resource cleanup:
class MyGameEnv(vf.MultiTurnEnv):
@vf.cleanup
async def save_game_log(self, state: vf.State):
"""Called after each rollout completes."""
await log_game_result(state["game_id"], state["score"])
@vf.teardown
async def close_connections(self):
"""Called once when environment shuts down."""
await self.db_connection.close()
Important: Cleanup methods should be idempotent (safe to call multiple times) and handle errors gracefully. This ensures correct behavior when rollouts are cancelled or interrupted.
Custom Message Assembly
Override get_prompt_messages() for non-linear conversations:
class MyGameEnv(vf.MultiTurnEnv):
async def get_prompt_messages(self, state: vf.State) -> vf.Messages:
if len(state["trajectory"]) == 0:
# First turn: return initial prompt
return state["prompt"]
# Subsequent turns: reconstruct conversation with game state
messages = []
messages.append({"role": "system", "content": self.system_prompt})
for turn in state["trajectory"]:
messages.extend(turn["completion"])
# Add environment response
env_response = await self.env_response(messages, state)
messages.extend(env_response)
return messages
Trajectory Tracking
Add metadata to each turn:
class MyGameEnv(vf.MultiTurnEnv):
async def add_trajectory_step(self, state: vf.State, trajectory_step):
"""Add custom metadata to each turn."""
trajectory_step["extras"]["board_state"] = state["board"].copy()
trajectory_step["extras"]["valid_moves"] = state["valid_moves"]
await super().add_trajectory_step(state, trajectory_step)
Error Handling
Verifiers provides a hierarchy of error types under vf.Error:
vf.ModelError # Model interaction errors
vf.OverlongPromptError # Prompt exceeds context length
vf.ToolError # Tool-related errors
vf.InfraError # Infrastructure errors (e.g., sandbox)
When a vf.Error is raised during a rollout:
- It’s caught automatically
- Stored in
state["error"]
- The built-in
has_error stop condition triggers
- The rollout terminates gracefully
Example:
class MyGameEnv(vf.MultiTurnEnv):
async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
try:
result = await self.external_api.call(messages)
return [{"role": "user", "content": result}]
except ExternalAPIError as e:
raise vf.InfraError(f"API call failed: {e}") from e
Monitor Rubrics
Track environment-specific metrics automatically:
class MyMonitorRubric(vf.Rubric):
def __init__(self):
super().__init__()
self.add_metric(self.average_score)
self.add_metric(self.total_moves)
async def average_score(self, state: vf.State) -> float:
turns = len(state["trajectory"])
total_score = state.get("score", 0)
return total_score / max(turns, 1)
async def total_moves(self, state: vf.State) -> float:
return float(len(state["trajectory"]))
class MyGameEnv(vf.MultiTurnEnv):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.add_rubric(MyMonitorRubric())
MultiTurnEnv automatically tracks num_turns for all multi-turn environments.
Testing Your Environment
Install and run a quick test
prime env install my-game-env
prime eval run my-game-env -m gpt-4.1-mini -n 5 -r 3
Loading environment: my-game-env
Running 5 examples × 3 rollouts = 15 total rollouts
Progress: ████████████████████ 15/15 (100%)
Results:
Reward: 0.67 ± 0.21
won_game: 0.67 ± 0.47
efficiency_bonus: 0.23 ± 0.18
num_turns: 6.2 ± 2.1
prime eval run my-game-env -m gpt-4.1-mini -n 2 -v
Shows detailed logs including:
Model requests and responses
Environment responses
State updates
Stop condition checks
prime eval run my-game-env -m gpt-4.1-mini -n 10 -s -C "attempts,won,target_number"
Saves results to ./environments/my_game_env/outputs/evals/ including custom state columns.
Common Pitfalls
Don’t override rollout() — The base implementation handles the core loop correctly. Override specific methods like env_response(), setup_state(), and stop conditions instead.
Return new messages, don’t mutate — env_response() should return a list of new messages to append, not modify existing messages.
Make cleanup idempotent — Cleanup methods may be called multiple times or when resources are in unexpected states. Handle errors gracefully.
Next Steps