Skip to main content

Overview

Environments are the core abstraction in Verifiers that define how language models interact with tasks. Each environment orchestrates the full lifecycle of a rollout: loading data, managing model interactions, executing tools or game logic, and computing rewards.
import verifiers as vf

def load_environment():
    return vf.SingleTurnEnv(
        dataset=dataset,
        rubric=rubric,
        system_prompt="You are a helpful assistant."
    )

Environment Hierarchy

All environments inherit from the abstract Environment base class and implement a rollout() method. The class hierarchy provides progressively more specialized interaction patterns:
Environment (abstract base)
├── SingleTurnEnv (single response Q&A)
└── MultiTurnEnv (multi-turn interactions)
    ├── ToolEnv (stateless tool calling)
    │   ├── StatefulToolEnv (tools with per-rollout state)
    │   │   ├── SandboxEnv (containerized bash execution)
    │   │   │   └── PythonEnv (persistent Python REPL)
    │   │   └── CliAgentEnv (custom agent code in sandboxes)
    │   └── MCPEnv (MCP server integration)
    └── Custom environments (games, simulations, etc.)

Environment Types

SingleTurnEnv

The simplest environment for single-response tasks where the model generates one completion per prompt.
import verifiers as vf
from datasets import Dataset

dataset = Dataset.from_list([
    {"prompt": [{"role": "user", "content": "What is 2+2?"}], "answer": "4"},
    {"prompt": [{"role": "user", "content": "What is 3*5?"}], "answer": "15"},
])

async def correct_answer(completion, answer) -> float:
    response = completion[-1]["content"]
    return 1.0 if answer in response else 0.0

rubric = vf.Rubric(funcs=[correct_answer])
env = vf.SingleTurnEnv(dataset=dataset, rubric=rubric)
Key characteristics:
  • One model response per rollout
  • No environment feedback loop
  • Perfect for Q&A, classification, or completion tasks

MultiTurnEnv

Enables multi-turn interactions where the environment responds after each model turn. Subclasses must implement env_response().
class MyGameEnv(vf.MultiTurnEnv):
    async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
        """Generate environment feedback after each model turn."""
        parsed = self.parser.parse(messages[-1]["content"])
        action = parsed.action
        result = self.process_action(action, state)
        return [{"role": "user", "content": result}]
Built-in stop conditions:
  • has_error - Stops on any vf.Error in state["error"]
  • prompt_too_long - Stops if prompt exceeds model context length
  • max_turns_reached - Stops after max_turns iterations
  • has_final_env_response - Stops when state["final_env_response"] is set
Constructor parameters:
MultiTurnEnv(
    dataset: Dataset,
    rubric: Rubric,
    max_turns: int = -1,  # -1 means unlimited
    **kwargs
)

ToolEnv

Adds tool calling capabilities with stateless Python functions. Tools are automatically converted to OpenAI-compatible schemas.
async def calculate(expression: str) -> str:
    """Evaluate a mathematical expression.
    
    Args:
        expression: A mathematical expression to evaluate (e.g. "2 + 2 * 3")
    
    Returns:
        The result of the evaluation.
    """
    try:
        result = eval(expression)
        return str(result)
    except Exception as e:
        return f"Error: {e}"

env = vf.ToolEnv(
    dataset=dataset,
    tools=[calculate],
    rubric=rubric,
    max_turns=10
)
Tool schema extraction:
  • Function name → tool name
  • Type hints → parameter types
  • Docstring → tool description and parameter descriptions
Stop behavior:
  • Stops when model responds without tool calls (built-in no_tools_called condition)
  • Configurable error handling via stop_errors parameter

StatefulToolEnv

For tools that require per-rollout state (sandbox IDs, database connections, session handles).
class MySandboxEnv(vf.StatefulToolEnv):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        # Register tool with hidden argument
        self.add_tool(self.run_code, args_to_skip=["session_id"])
    
    async def setup_state(self, state, **kwargs):
        # Initialize per-rollout resources
        state["session_id"] = await create_session()
        return await super().setup_state(state, **kwargs)
    
    def update_tool_args(self, tool_name, tool_args, messages, state, **kwargs):
        # Inject state into tool calls
        if tool_name == "run_code":
            tool_args["session_id"] = state["session_id"]
        return tool_args
    
    async def run_code(self, code: str, session_id: str) -> str:
        """Execute code in the sandbox."""
        return await execute_in_session(session_id, code)
Pattern:
  1. Add tools with args_to_skip for hidden parameters
  2. Initialize state in setup_state()
  3. Inject state values in update_tool_args()

SandboxEnv

Provides containerized bash execution using Prime Intellect’s Sandboxes.
env = vf.SandboxEnv(
    dataset=dataset,
    rubric=rubric,
    sandbox_name="my-sandbox",
    docker_image="python:3.11-slim",
    start_command="tail -f /dev/null",
    cpu_cores=2,
    memory_gb=4,
    disk_size_gb=10,
    timeout_minutes=60,
    timeout_per_command_seconds=30,
    environment_vars={"API_KEY": "..."},
    labels=["experiment-1", "math-tasks"],  # optional categorization
)
Built-in tool:
  • bash(command: str) - Execute shell commands in the sandbox
Lifecycle:
  • Sandboxes are created in setup_state() (per rollout)
  • Destroyed in cleanup handlers after each rollout
  • All setup logic should be in start_command, not awaited until first use

PythonEnv

Extends SandboxEnv with a persistent Python REPL.
env = vf.PythonEnv(
    dataset=dataset,
    rubric=rubric,
    packages=["numpy", "pandas"],  # auto-installed in sandbox
)
Built-in tool:
  • python(code: str) - Execute Python code in the persistent REPL

MCPEnv

Integrates with MCP (Model Context Protocol) servers.
mcp_servers = [
    {
        "name": "fetch",
        "command": "uvx",
        "args": ["mcp-server-fetch"],
    },
]

env = vf.MCPEnv(
    mcp_servers=mcp_servers,
    dataset=dataset,
    rubric=rubric,
)
Features:
  • Automatically discovers and exposes MCP server tools
  • Manages server lifecycle
  • Supports multiple concurrent MCP servers

Base Environment Class

Constructor Parameters

All environments accept these common parameters:
Environment(
    dataset: Dataset | DatasetBuilder | None = None,
    eval_dataset: Dataset | DatasetBuilder | None = None,
    system_prompt: str | None = None,
    few_shot: Messages | None = None,
    parser: Parser | None = None,
    rubric: Rubric | None = None,
    sampling_args: SamplingArgs | None = None,
    max_workers: int = 512,
    env_id: str | None = None,
    env_args: dict | None = None,
    max_seq_len: int | None = None,
    score_rollouts: bool = True,
    pass_threshold: float = 0.5,
)
Key parameters:
  • dataset / eval_dataset - Training and evaluation datasets (can be DatasetBuilder for lazy loading)
  • system_prompt - Prepended to all prompts as a system message
  • few_shot - Example messages inserted after system prompt
  • parser - For extracting structured output (e.g., vf.XMLParser)
  • rubric - Reward functions and scoring logic
  • sampling_args - Default generation parameters (temperature, top_p, etc.)
  • max_seq_len - Maximum sequence length for tokenization
  • score_rollouts - Whether to score rollouts (disable for pure generation)

Core Methods

Generation

# Asynchronous generation
results = await env.generate(
    inputs=dataset,
    client=client,
    model="gpt-4",
    sampling_args={"temperature": 0.7},
    max_concurrent=10,
    save_results=True,
    results_path=Path("./results")
)

# Synchronous wrapper
results = env.generate_sync(inputs=dataset, client=client, model="gpt-4")
Returns: GenerateOutputs with outputs (list of RolloutOutput) and metadata

Evaluation

# Evaluate on eval_dataset
results = await env.evaluate(
    client=client,
    model="gpt-4",
    num_examples=100,
    rollouts_per_example=4,
    save_results=True
)

# Synchronous wrapper
results = env.evaluate_sync(client=client, model="gpt-4", num_examples=10)

Dataset Access

# Get datasets (triggers lazy loading if using DatasetBuilder)
train_ds = env.get_dataset(n=100, seed=42)
eval_ds = env.get_eval_dataset(n=50)

Environment Groups

EnvGroup combines multiple environments for multi-task training:
math_env = vf.SingleTurnEnv(dataset=math_data, rubric=math_rubric)
code_env = vf.ToolEnv(dataset=code_data, tools=[execute_code], rubric=code_rubric)
reasoning_env = vf.MultiTurnEnv(dataset=reasoning_data, rubric=reasoning_rubric)

combined = vf.EnvGroup(
    envs=[math_env, code_env, reasoning_env],
    env_names=["math", "code", "reasoning"],  # optional
)
Behavior:
  • Concatenates all sub-environment datasets
  • Routes each rollout to the appropriate environment via task column
  • Aggregates metrics across all environments
Environment groups are particularly useful for curriculum learning and multi-task RL training where you want to train a single model across diverse task types.

Advanced Customization

Custom Stop Conditions

Define custom termination logic with the @vf.stop decorator:
class MyEnv(vf.MultiTurnEnv):
    @vf.stop(priority=10)  # Higher priority runs first
    async def answer_submitted(self, state: vf.State) -> bool:
        completion = state.get("completion", [])
        if not completion:
            return False
        return "FINAL ANSWER:" in completion[-1].get("content", "")

Resource Management

Use lifecycle decorators for setup and cleanup:
class MyEnv(vf.MultiTurnEnv):
    async def setup_state(self, state: vf.State) -> vf.State:
        """Per-rollout initialization."""
        state["game_id"] = await create_game()
        return await super().setup_state(state)
    
    @vf.cleanup
    async def save_game_log(self, state: vf.State):
        """Per-rollout cleanup."""
        await save_log(state["game_id"])
    
    @vf.teardown
    async def close_connections(self):
        """Environment-level teardown."""
        await self.db.close()
Cleanup methods must be idempotent (safe to call multiple times) and handle errors gracefully to ensure cleanup completes even when resources are in unexpected states.

Signaling Early Termination

Set state["final_env_response"] to bypass model response and end the rollout:
async def env_response(self, messages: vf.Messages, state: vf.State) -> vf.Messages:
    if check_game_over(state):
        final_msg = [{"role": "user", "content": f"Game over! Score: {state['score']}"}]
        state["final_env_response"] = final_msg
        return final_msg
    # Normal response logic...

Integration Examples

TextArena Integration

Wrapper for text-based game environments:
env = vf.TextArenaEnv(
    game_name="rock_paper_scissors",
    num_players=2,
    dataset=dataset,
    rubric=rubric
)

ReasoningGym Integration

Procedural reasoning tasks:
from verifiers.envs.integrations import DatasetSpec

env = vf.ReasoningGymEnv(
    dataset_spec=DatasetSpec(
        name="sorting",
        num_samples=100,
        difficulty="hard"
    ),
    rubric=rubric
)

Browser Automation

Browserbase integration with DOM or vision-based control:
# DOM mode (natural language browser control)
env = vf.BrowserEnv(
    mode="dom",
    dataset=dataset,
    rubric=rubric
)

# CUA mode (coordinate-based vision control)
env = vf.BrowserEnv(
    mode="cua",
    use_sandbox=True,  # auto-deploy CUA server in sandbox
    dataset=dataset,
    rubric=rubric
)
See the Integrations and Experimental Environments section in the main environments guide for more details on third-party integrations.

Build docs developers (and LLMs) love