Skip to main content

ToolEnv

Environment for tasks where the model can call Python functions as tools.

Overview

ToolEnv enables LLMs to call Python functions with all arguments exposed to the model. Key features:
  • Stateless tools: Each tool call is independent and idempotent
  • Automatic schema generation: Function signatures are converted to tool definitions
  • Error handling: Configurable error formatting and stop-on-error behavior
  • Tool metrics: Automatic tracking of tool call counts
For tools requiring per-rollout state (e.g., sandbox IDs, database connections), use StatefulToolEnv instead.

Inheritance

Environment
└── MultiTurnEnv
    └── ToolEnv
        └── StatefulToolEnv

Constructor

ToolEnv(
    tools: list[Callable] | None = None,
    max_turns: int = 10,
    error_formatter: Callable[[Exception], str] = lambda e: f"{e}",
    stop_errors: list[type[Exception]] | None = None,
    **kwargs
)

Parameters

tools
list[Callable] | None
List of Python functions to expose as tools. Function signatures and docstrings are used to generate tool schemas.
max_turns
int
default:"10"
Maximum number of turns before stopping.
error_formatter
Callable[[Exception], str]
default:"lambda e: f'{e}'"
Function to format exceptions into error messages shown to the model.
stop_errors
list[type[Exception]] | None
List of exception types that should stop the rollout (raise ToolParseError or ToolCallError).
All other parameters are inherited from MultiTurnEnv.

Core Methods

call_tool

async def call_tool(
    tool_name: str,
    tool_args: dict,
    tool_call_id: str,
    **kwargs
) -> ToolMessage
Execute a tool and return the result as a ToolMessage. Override to customize tool execution.
tool_name
str
Name of the tool to call.
tool_args
dict
Arguments parsed from the model’s tool call.
tool_call_id
str
Unique ID for this tool call.
Returns: ToolMessage - Message containing tool result or error.

env_response

async def env_response(
    messages: vf.Messages,
    state: vf.State,
    **kwargs
) -> vf.Messages
Process tool calls from the model’s response. Implemented by ToolEnv - do not override unless you need custom behavior.
messages
vf.Messages
Conversation history including model’s tool calls.
state
vf.State
Current rollout state.
Returns: vf.Messages - List of ToolMessage objects with results.

add_tool

def add_tool(tool: Callable)
Dynamically add a tool to the environment.
tool
Callable
Python function to add as a tool.

remove_tool

def remove_tool(tool: Callable)
Remove a tool from the environment.
tool
Callable
Python function to remove.

Stop Conditions

no_tools_called

@vf.stop
async def no_tools_called(state: vf.State) -> bool
Stops if the model’s last message was an assistant message with no tool calls. Inherits all stop conditions from MultiTurnEnv.

Built-in Rubric

ToolEnv includes ToolMonitorRubric which tracks:
  • total_tool_calls: Total number of tool calls made
  • {tool_name}_calls: Number of calls to each specific tool

Example Usage

Basic Calculator

import verifiers as vf

def add(a: float, b: float) -> float:
    """Add two numbers."""
    return a + b

def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    return a * b

def divide(a: float, b: float) -> float:
    """Divide two numbers."""
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

def load_environment():
    # Create dataset
    dataset = vf.Environment.make_dataset(
        [
            {"question": "What is (10 + 5) * 3?", "answer": "45"},
            {"question": "What is 100 / 4?", "answer": "25"},
        ]
    )
    
    def correct_answer(answer: str, completion: vf.Messages) -> float:
        """Check if final answer matches expected answer."""
        completion_text = str(completion)
        return 1.0 if answer in completion_text else 0.0
    
    return vf.ToolEnv(
        tools=[add, multiply, divide],
        dataset=dataset,
        rubric=vf.Rubric(correct_answer),
        system_prompt="Use the available tools to solve the math problem.",
        max_turns=5
    )

# Usage
env = load_environment()
results = await env.evaluate(
    client=vf.ClientConfig(provider="openai", api_key="sk-..."),
    model="gpt-4",
    num_examples=2
)

print(f"Accuracy: {results['metadata']['avg_reward']}")
print(f"Avg tool calls: {results['metadata']['avg_total_tool_calls']}")

With Error Handling

import verifiers as vf

class DivisionError(Exception):
    """Custom error for division problems."""
    pass

def divide(a: float, b: float) -> float:
    """Divide two numbers."""
    if b == 0:
        raise DivisionError("Cannot divide by zero")
    return a / b

def load_environment():
    dataset = vf.Environment.make_dataset(
        [{"question": "What is 10 / 0?", "answer": "error"}]
    )
    
    def error_formatter(e: Exception) -> str:
        """Format errors for the model."""
        if isinstance(e, DivisionError):
            return "Error: Division by zero is not allowed."
        return f"Error: {str(e)}"
    
    def handles_error(completion: vf.Messages) -> float:
        """Reward if model acknowledges the error."""
        text = str(completion).lower()
        return 1.0 if "error" in text or "cannot" in text else 0.0
    
    return vf.ToolEnv(
        tools=[divide],
        dataset=dataset,
        rubric=vf.Rubric(handles_error),
        error_formatter=error_formatter,
        # Don't stop on DivisionError, let model handle it
        stop_errors=[],  # Empty list = no errors cause stop
        max_turns=3
    )

With Stop Errors

import verifiers as vf

class CriticalError(Exception):
    pass

def risky_operation(value: int) -> str:
    if value < 0:
        raise CriticalError("Negative values not allowed")
    return f"Result: {value * 2}"

def load_environment():
    dataset = vf.Environment.make_dataset(
        [{"question": "Process the value -5"}]
    )
    
    return vf.ToolEnv(
        tools=[risky_operation],
        dataset=dataset,
        rubric=vf.Rubric(lambda completion: 0.0),
        # Stop rollout immediately if CriticalError occurs
        stop_errors=[CriticalError],
        max_turns=5
    )

# When CriticalError is raised, the rollout stops and
# state["error"] contains a ToolCallError

Database Query Tools

import verifiers as vf
import sqlite3

def query_users(name: str) -> list[dict]:
    """Query users by name."""
    # Stateless query (creates new connection each time)
    conn = sqlite3.connect("users.db")
    cursor = conn.execute("SELECT * FROM users WHERE name LIKE ?", (f"%{name}%",))
    results = [{"id": row[0], "name": row[1]} for row in cursor.fetchall()]
    conn.close()
    return results

def query_orders(user_id: int) -> list[dict]:
    """Query orders for a user."""
    conn = sqlite3.connect("users.db")
    cursor = conn.execute("SELECT * FROM orders WHERE user_id = ?", (user_id,))
    results = [{"id": row[0], "total": row[1]} for row in cursor.fetchall()]
    conn.close()
    return results

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"question": "How many orders does user 'Alice' have?", "answer": "3"},
        ]
    )
    
    def correct_count(answer: str, completion: vf.Messages) -> float:
        return 1.0 if answer in str(completion) else 0.0
    
    return vf.ToolEnv(
        tools=[query_users, query_orders],
        dataset=dataset,
        rubric=vf.Rubric(correct_count),
        system_prompt="Use the database tools to answer questions.",
        max_turns=10
    )

API Client Tools

import verifiers as vf
import httpx

def get_weather(city: str) -> dict:
    """Get current weather for a city."""
    # Stateless API call
    response = httpx.get(f"https://api.weather.com/v1/current?city={city}")
    return response.json()

def get_forecast(city: str, days: int = 3) -> dict:
    """Get weather forecast for a city."""
    response = httpx.get(
        f"https://api.weather.com/v1/forecast?city={city}&days={days}"
    )
    return response.json()

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"question": "Will it rain in London tomorrow?", "answer": "yes"},
        ]
    )
    
    def mentions_rain(answer: str, completion: vf.Messages) -> float:
        text = str(completion).lower()
        answer_lower = answer.lower()
        return 1.0 if answer_lower in text else 0.0
    
    return vf.ToolEnv(
        tools=[get_weather, get_forecast],
        dataset=dataset,
        rubric=vf.Rubric(mentions_rain),
        max_turns=5
    )

Dynamic Tool Addition

import verifiers as vf

def base_tool() -> str:
    return "base"

env = vf.ToolEnv(
    tools=[base_tool],
    dataset=dataset,
    rubric=vf.Rubric(reward_fn)
)

# Add tool dynamically
def new_tool(x: int) -> int:
    """New tool added at runtime."""
    return x * 2

env.add_tool(new_tool)

# Remove tool
env.remove_tool(base_tool)

Tool Schema Generation

Tools are automatically converted to schema using function signatures and docstrings:
def search(query: str, max_results: int = 10) -> list[str]:
    """Search for documents matching the query.
    
    Args:
        query: Search query string
        max_results: Maximum number of results to return
    """
    return ["result1", "result2"]
Generates schema:
{
  "name": "search",
  "description": "Search for documents matching the query.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {"type": "string", "description": "Search query string"},
      "max_results": {"type": "integer", "description": "Maximum number of results to return", "default": 10}
    },
    "required": ["query"]
  }
}

Common Patterns

Stateless Tools Only

All tool calls should be independent:
def good_tool(x: int) -> int:
    # No shared state, idempotent
    return x * 2

# Avoid global state
state = {}
def bad_tool(x: int) -> int:
    state["count"] = state.get("count", 0) + 1  # Bad!
    return x * state["count"]
For stateful tools, use StatefulToolEnv.

Custom Error Messages

Format errors to guide the model:
def error_formatter(e: Exception) -> str:
    if isinstance(e, ValueError):
        return f"Invalid input: {e}. Please provide a valid number."
    elif isinstance(e, KeyError):
        return f"Key not found: {e}. Available keys: X, Y, Z."
    return f"Error: {e}"

env = vf.ToolEnv(
    tools=[...],
    error_formatter=error_formatter,
    ...
)

Reward Based on Tool Usage

def efficiency_reward(state: vf.State) -> float:
    """Reward fewer tool calls."""
    metrics = state["metrics"]
    num_calls = metrics.get("total_tool_calls", 0)
    if state["reward"] == 1.0:  # Correct answer
        return 1.0 / (1 + num_calls)  # Fewer calls = higher reward
    return 0.0

When to Use

Use ToolEnv for:
  • Stateless function calling (calculators, converters, queries)
  • API clients (each call is independent)
  • Read-only database queries
  • File reading operations
  • Any idempotent tool
Use StatefulToolEnv for:
  • Tools requiring per-rollout state (sandbox IDs, sessions)
  • Database transactions
  • File writing in isolated environments
  • Any tool where state must persist across calls

See Also

Build docs developers (and LLMs) love