ToolEnv

Environment for tasks where the model can call Python functions as tools.

Overview

ToolEnv enables LLMs to call Python functions with all arguments exposed to the model. Key features:

Stateless tools: Each tool call is independent and idempotent
Automatic schema generation: Function signatures are converted to tool definitions
Error handling: Configurable error formatting and stop-on-error behavior
Tool metrics: Automatic tracking of tool call counts

For tools requiring per-rollout state (e.g., sandbox IDs, database connections), use StatefulToolEnv instead.

Inheritance

Environment
└── MultiTurnEnv
    └── ToolEnv
        └── StatefulToolEnv

Constructor

ToolEnv(
    tools: list[Callable] | None = None,
    max_turns: int = 10,
    error_formatter: Callable[[Exception], str] = lambda e: f"{e}",
    stop_errors: list[type[Exception]] | None = None,
    **kwargs
)

Parameters

tools

list[Callable] | None

List of Python functions to expose as tools. Function signatures and docstrings are used to generate tool schemas.

max_turns

int

default:"10"

Maximum number of turns before stopping.

error_formatter

Callable[[Exception], str]

default:"lambda e: f'{e}'"

Function to format exceptions into error messages shown to the model.

stop_errors

list[type[Exception]] | None

List of exception types that should stop the rollout (raise ToolParseError or ToolCallError).

All other parameters are inherited from MultiTurnEnv.

Core Methods

call_tool

async def call_tool(
    tool_name: str,
    tool_args: dict,
    tool_call_id: str,
    **kwargs
) -> ToolMessage

Execute a tool and return the result as a ToolMessage. Override to customize tool execution.

tool_name

str

Name of the tool to call.

tool_args

dict

Arguments parsed from the model’s tool call.

tool_call_id

str

Unique ID for this tool call.

Returns: ToolMessage - Message containing tool result or error.

env_response

async def env_response(
    messages: vf.Messages,
    state: vf.State,
    **kwargs
) -> vf.Messages

Process tool calls from the model’s response. Implemented by ToolEnv - do not override unless you need custom behavior.

messages

vf.Messages

Conversation history including model’s tool calls.

state

vf.State

Current rollout state.

Returns: vf.Messages - List of ToolMessage objects with results.

add_tool

def add_tool(tool: Callable)

Dynamically add a tool to the environment.

tool

Callable

Python function to add as a tool.

remove_tool

def remove_tool(tool: Callable)

Remove a tool from the environment.

tool

Callable

Python function to remove.

Stop Conditions

no_tools_called

@vf.stop
async def no_tools_called(state: vf.State) -> bool

Stops if the model’s last message was an assistant message with no tool calls. Inherits all stop conditions from MultiTurnEnv.

Built-in Rubric

ToolEnv includes ToolMonitorRubric which tracks:

total_tool_calls: Total number of tool calls made
{tool_name}_calls: Number of calls to each specific tool

Example Usage

Basic Calculator

import verifiers as vf

def add(a: float, b: float) -> float:
    """Add two numbers."""
    return a + b

def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    return a * b

def divide(a: float, b: float) -> float:
    """Divide two numbers."""
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

def load_environment():
    # Create dataset
    dataset = vf.Environment.make_dataset(
        [
            {"question": "What is (10 + 5) * 3?", "answer": "45"},
            {"question": "What is 100 / 4?", "answer": "25"},
        ]
    )
    
    def correct_answer(answer: str, completion: vf.Messages) -> float:
        """Check if final answer matches expected answer."""
        completion_text = str(completion)
        return 1.0 if answer in completion_text else 0.0
    
    return vf.ToolEnv(
        tools=[add, multiply, divide],
        dataset=dataset,
        rubric=vf.Rubric(correct_answer),
        system_prompt="Use the available tools to solve the math problem.",
        max_turns=5
    )

# Usage
env = load_environment()
results = await env.evaluate(
    client=vf.ClientConfig(provider="openai", api_key="sk-..."),
    model="gpt-4",
    num_examples=2
)

print(f"Accuracy: {results['metadata']['avg_reward']}")
print(f"Avg tool calls: {results['metadata']['avg_total_tool_calls']}")

With Error Handling

import verifiers as vf

class DivisionError(Exception):
    """Custom error for division problems."""
    pass

def divide(a: float, b: float) -> float:
    """Divide two numbers."""
    if b == 0:
        raise DivisionError("Cannot divide by zero")
    return a / b

def load_environment():
    dataset = vf.Environment.make_dataset(
        [{"question": "What is 10 / 0?", "answer": "error"}]
    )
    
    def error_formatter(e: Exception) -> str:
        """Format errors for the model."""
        if isinstance(e, DivisionError):
            return "Error: Division by zero is not allowed."
        return f"Error: {str(e)}"
    
    def handles_error(completion: vf.Messages) -> float:
        """Reward if model acknowledges the error."""
        text = str(completion).lower()
        return 1.0 if "error" in text or "cannot" in text else 0.0
    
    return vf.ToolEnv(
        tools=[divide],
        dataset=dataset,
        rubric=vf.Rubric(handles_error),
        error_formatter=error_formatter,
        # Don't stop on DivisionError, let model handle it
        stop_errors=[],  # Empty list = no errors cause stop
        max_turns=3
    )

With Stop Errors

import verifiers as vf

class CriticalError(Exception):
    pass

def risky_operation(value: int) -> str:
    if value < 0:
        raise CriticalError("Negative values not allowed")
    return f"Result: {value * 2}"

def load_environment():
    dataset = vf.Environment.make_dataset(
        [{"question": "Process the value -5"}]
    )
    
    return vf.ToolEnv(
        tools=[risky_operation],
        dataset=dataset,
        rubric=vf.Rubric(lambda completion: 0.0),
        # Stop rollout immediately if CriticalError occurs
        stop_errors=[CriticalError],
        max_turns=5
    )

# When CriticalError is raised, the rollout stops and
# state["error"] contains a ToolCallError

Database Query Tools

import verifiers as vf
import sqlite3

def query_users(name: str) -> list[dict]:
    """Query users by name."""
    # Stateless query (creates new connection each time)
    conn = sqlite3.connect("users.db")
    cursor = conn.execute("SELECT * FROM users WHERE name LIKE ?", (f"%{name}%",))
    results = [{"id": row[0], "name": row[1]} for row in cursor.fetchall()]
    conn.close()
    return results

def query_orders(user_id: int) -> list[dict]:
    """Query orders for a user."""
    conn = sqlite3.connect("users.db")
    cursor = conn.execute("SELECT * FROM orders WHERE user_id = ?", (user_id,))
    results = [{"id": row[0], "total": row[1]} for row in cursor.fetchall()]
    conn.close()
    return results

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"question": "How many orders does user 'Alice' have?", "answer": "3"},
        ]
    )
    
    def correct_count(answer: str, completion: vf.Messages) -> float:
        return 1.0 if answer in str(completion) else 0.0
    
    return vf.ToolEnv(
        tools=[query_users, query_orders],
        dataset=dataset,
        rubric=vf.Rubric(correct_count),
        system_prompt="Use the database tools to answer questions.",
        max_turns=10
    )

API Client Tools

import verifiers as vf
import httpx

def get_weather(city: str) -> dict:
    """Get current weather for a city."""
    # Stateless API call
    response = httpx.get(f"https://api.weather.com/v1/current?city={city}")
    return response.json()

def get_forecast(city: str, days: int = 3) -> dict:
    """Get weather forecast for a city."""
    response = httpx.get(
        f"https://api.weather.com/v1/forecast?city={city}&days={days}"
    )
    return response.json()

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"question": "Will it rain in London tomorrow?", "answer": "yes"},
        ]
    )
    
    def mentions_rain(answer: str, completion: vf.Messages) -> float:
        text = str(completion).lower()
        answer_lower = answer.lower()
        return 1.0 if answer_lower in text else 0.0
    
    return vf.ToolEnv(
        tools=[get_weather, get_forecast],
        dataset=dataset,
        rubric=vf.Rubric(mentions_rain),
        max_turns=5
    )

Dynamic Tool Addition

import verifiers as vf

def base_tool() -> str:
    return "base"

env = vf.ToolEnv(
    tools=[base_tool],
    dataset=dataset,
    rubric=vf.Rubric(reward_fn)
)

# Add tool dynamically
def new_tool(x: int) -> int:
    """New tool added at runtime."""
    return x * 2

env.add_tool(new_tool)

# Remove tool
env.remove_tool(base_tool)

Tool Schema Generation

Tools are automatically converted to schema using function signatures and docstrings:

def search(query: str, max_results: int = 10) -> list[str]:
    """Search for documents matching the query.
    
    Args:
        query: Search query string
        max_results: Maximum number of results to return
    """
    return ["result1", "result2"]

Generates schema:

{
  "name": "search",
  "description": "Search for documents matching the query.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {"type": "string", "description": "Search query string"},
      "max_results": {"type": "integer", "description": "Maximum number of results to return", "default": 10}
    },
    "required": ["query"]
  }
}

Common Patterns

Stateless Tools Only

All tool calls should be independent:

def good_tool(x: int) -> int:
    # No shared state, idempotent
    return x * 2

# Avoid global state
state = {}
def bad_tool(x: int) -> int:
    state["count"] = state.get("count", 0) + 1  # Bad!
    return x * state["count"]

For stateful tools, use StatefulToolEnv.

Custom Error Messages

Format errors to guide the model:

def error_formatter(e: Exception) -> str:
    if isinstance(e, ValueError):
        return f"Invalid input: {e}. Please provide a valid number."
    elif isinstance(e, KeyError):
        return f"Key not found: {e}. Available keys: X, Y, Z."
    return f"Error: {e}"

env = vf.ToolEnv(
    tools=[...],
    error_formatter=error_formatter,
    ...
)

Reward Based on Tool Usage

def efficiency_reward(state: vf.State) -> float:
    """Reward fewer tool calls."""
    metrics = state["metrics"]
    num_calls = metrics.get("total_tool_calls", 0)
    if state["reward"] == 1.0:  # Correct answer
        return 1.0 / (1 + num_calls)  # Fewer calls = higher reward
    return 0.0

When to Use

Use ToolEnv for:

Stateless function calling (calculators, converters, queries)
API clients (each call is independent)
Read-only database queries
File reading operations
Any idempotent tool

Use StatefulToolEnv for:

Tools requiring per-rollout state (sandbox IDs, sessions)
Database transactions
File writing in isolated environments
Any tool where state must persist across calls

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

ToolEnv

ToolEnv

Overview

Inheritance

Constructor

Parameters

Core Methods

call_tool

env_response

add_tool

remove_tool

Stop Conditions

no_tools_called

Built-in Rubric

Example Usage

Basic Calculator

With Error Handling

With Stop Errors

Database Query Tools

API Client Tools

Dynamic Tool Addition

Tool Schema Generation

Common Patterns

Stateless Tools Only

Custom Error Messages

Reward Based on Tool Usage

When to Use

See Also

Build docs developers (and LLMs) love

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​ToolEnv

​Overview

​Inheritance

​Constructor

​Parameters

​Core Methods

​call_tool

​env_response

​add_tool

​remove_tool

​Stop Conditions

​no_tools_called

​Built-in Rubric

​Example Usage

​Basic Calculator

​With Error Handling

​With Stop Errors

​Database Query Tools

​API Client Tools

​Dynamic Tool Addition

​Tool Schema Generation

​Common Patterns

​Stateless Tools Only

​Custom Error Messages

​Reward Based on Tool Usage

​When to Use

​See Also

Build docs developers (and LLMs) love

ToolEnv

Overview

Inheritance

Constructor

Parameters

Core Methods

call_tool

env_response

add_tool

remove_tool

Stop Conditions

no_tools_called

Built-in Rubric

Example Usage

Basic Calculator

With Error Handling

With Stop Errors

Database Query Tools

API Client Tools

Dynamic Tool Addition

Tool Schema Generation

Common Patterns

Stateless Tools Only

Custom Error Messages

Reward Based on Tool Usage

When to Use

See Also