ToolEnv
Environment for tasks where the model can call Python functions as tools.
Overview
ToolEnv enables LLMs to call Python functions with all arguments exposed to the model. Key features:
- Stateless tools: Each tool call is independent and idempotent
- Automatic schema generation: Function signatures are converted to tool definitions
- Error handling: Configurable error formatting and stop-on-error behavior
- Tool metrics: Automatic tracking of tool call counts
For tools requiring per-rollout state (e.g., sandbox IDs, database connections), use StatefulToolEnv instead.
Inheritance
Environment
└── MultiTurnEnv
└── ToolEnv
└── StatefulToolEnv
Constructor
ToolEnv(
tools: list[Callable] | None = None,
max_turns: int = 10,
error_formatter: Callable[[Exception], str] = lambda e: f"{e}",
stop_errors: list[type[Exception]] | None = None,
**kwargs
)
Parameters
List of Python functions to expose as tools. Function signatures and docstrings are used to generate tool schemas.
Maximum number of turns before stopping.
error_formatter
Callable[[Exception], str]
default:"lambda e: f'{e}'"
Function to format exceptions into error messages shown to the model.
stop_errors
list[type[Exception]] | None
List of exception types that should stop the rollout (raise ToolParseError or ToolCallError).
All other parameters are inherited from MultiTurnEnv.
Core Methods
async def call_tool(
tool_name: str,
tool_args: dict,
tool_call_id: str,
**kwargs
) -> ToolMessage
Execute a tool and return the result as a ToolMessage. Override to customize tool execution.
Name of the tool to call.
Arguments parsed from the model’s tool call.
Unique ID for this tool call.
Returns: ToolMessage - Message containing tool result or error.
env_response
async def env_response(
messages: vf.Messages,
state: vf.State,
**kwargs
) -> vf.Messages
Process tool calls from the model’s response. Implemented by ToolEnv - do not override unless you need custom behavior.
Conversation history including model’s tool calls.
Returns: vf.Messages - List of ToolMessage objects with results.
def add_tool(tool: Callable)
Dynamically add a tool to the environment.
Python function to add as a tool.
def remove_tool(tool: Callable)
Remove a tool from the environment.
Python function to remove.
Stop Conditions
@vf.stop
async def no_tools_called(state: vf.State) -> bool
Stops if the model’s last message was an assistant message with no tool calls.
Inherits all stop conditions from MultiTurnEnv.
Built-in Rubric
ToolEnv includes ToolMonitorRubric which tracks:
total_tool_calls: Total number of tool calls made
{tool_name}_calls: Number of calls to each specific tool
Example Usage
Basic Calculator
import verifiers as vf
def add(a: float, b: float) -> float:
"""Add two numbers."""
return a + b
def multiply(a: float, b: float) -> float:
"""Multiply two numbers."""
return a * b
def divide(a: float, b: float) -> float:
"""Divide two numbers."""
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
def load_environment():
# Create dataset
dataset = vf.Environment.make_dataset(
[
{"question": "What is (10 + 5) * 3?", "answer": "45"},
{"question": "What is 100 / 4?", "answer": "25"},
]
)
def correct_answer(answer: str, completion: vf.Messages) -> float:
"""Check if final answer matches expected answer."""
completion_text = str(completion)
return 1.0 if answer in completion_text else 0.0
return vf.ToolEnv(
tools=[add, multiply, divide],
dataset=dataset,
rubric=vf.Rubric(correct_answer),
system_prompt="Use the available tools to solve the math problem.",
max_turns=5
)
# Usage
env = load_environment()
results = await env.evaluate(
client=vf.ClientConfig(provider="openai", api_key="sk-..."),
model="gpt-4",
num_examples=2
)
print(f"Accuracy: {results['metadata']['avg_reward']}")
print(f"Avg tool calls: {results['metadata']['avg_total_tool_calls']}")
With Error Handling
import verifiers as vf
class DivisionError(Exception):
"""Custom error for division problems."""
pass
def divide(a: float, b: float) -> float:
"""Divide two numbers."""
if b == 0:
raise DivisionError("Cannot divide by zero")
return a / b
def load_environment():
dataset = vf.Environment.make_dataset(
[{"question": "What is 10 / 0?", "answer": "error"}]
)
def error_formatter(e: Exception) -> str:
"""Format errors for the model."""
if isinstance(e, DivisionError):
return "Error: Division by zero is not allowed."
return f"Error: {str(e)}"
def handles_error(completion: vf.Messages) -> float:
"""Reward if model acknowledges the error."""
text = str(completion).lower()
return 1.0 if "error" in text or "cannot" in text else 0.0
return vf.ToolEnv(
tools=[divide],
dataset=dataset,
rubric=vf.Rubric(handles_error),
error_formatter=error_formatter,
# Don't stop on DivisionError, let model handle it
stop_errors=[], # Empty list = no errors cause stop
max_turns=3
)
With Stop Errors
import verifiers as vf
class CriticalError(Exception):
pass
def risky_operation(value: int) -> str:
if value < 0:
raise CriticalError("Negative values not allowed")
return f"Result: {value * 2}"
def load_environment():
dataset = vf.Environment.make_dataset(
[{"question": "Process the value -5"}]
)
return vf.ToolEnv(
tools=[risky_operation],
dataset=dataset,
rubric=vf.Rubric(lambda completion: 0.0),
# Stop rollout immediately if CriticalError occurs
stop_errors=[CriticalError],
max_turns=5
)
# When CriticalError is raised, the rollout stops and
# state["error"] contains a ToolCallError
import verifiers as vf
import sqlite3
def query_users(name: str) -> list[dict]:
"""Query users by name."""
# Stateless query (creates new connection each time)
conn = sqlite3.connect("users.db")
cursor = conn.execute("SELECT * FROM users WHERE name LIKE ?", (f"%{name}%",))
results = [{"id": row[0], "name": row[1]} for row in cursor.fetchall()]
conn.close()
return results
def query_orders(user_id: int) -> list[dict]:
"""Query orders for a user."""
conn = sqlite3.connect("users.db")
cursor = conn.execute("SELECT * FROM orders WHERE user_id = ?", (user_id,))
results = [{"id": row[0], "total": row[1]} for row in cursor.fetchall()]
conn.close()
return results
def load_environment():
dataset = vf.Environment.make_dataset(
[
{"question": "How many orders does user 'Alice' have?", "answer": "3"},
]
)
def correct_count(answer: str, completion: vf.Messages) -> float:
return 1.0 if answer in str(completion) else 0.0
return vf.ToolEnv(
tools=[query_users, query_orders],
dataset=dataset,
rubric=vf.Rubric(correct_count),
system_prompt="Use the database tools to answer questions.",
max_turns=10
)
import verifiers as vf
import httpx
def get_weather(city: str) -> dict:
"""Get current weather for a city."""
# Stateless API call
response = httpx.get(f"https://api.weather.com/v1/current?city={city}")
return response.json()
def get_forecast(city: str, days: int = 3) -> dict:
"""Get weather forecast for a city."""
response = httpx.get(
f"https://api.weather.com/v1/forecast?city={city}&days={days}"
)
return response.json()
def load_environment():
dataset = vf.Environment.make_dataset(
[
{"question": "Will it rain in London tomorrow?", "answer": "yes"},
]
)
def mentions_rain(answer: str, completion: vf.Messages) -> float:
text = str(completion).lower()
answer_lower = answer.lower()
return 1.0 if answer_lower in text else 0.0
return vf.ToolEnv(
tools=[get_weather, get_forecast],
dataset=dataset,
rubric=vf.Rubric(mentions_rain),
max_turns=5
)
import verifiers as vf
def base_tool() -> str:
return "base"
env = vf.ToolEnv(
tools=[base_tool],
dataset=dataset,
rubric=vf.Rubric(reward_fn)
)
# Add tool dynamically
def new_tool(x: int) -> int:
"""New tool added at runtime."""
return x * 2
env.add_tool(new_tool)
# Remove tool
env.remove_tool(base_tool)
Tools are automatically converted to schema using function signatures and docstrings:
def search(query: str, max_results: int = 10) -> list[str]:
"""Search for documents matching the query.
Args:
query: Search query string
max_results: Maximum number of results to return
"""
return ["result1", "result2"]
Generates schema:
{
"name": "search",
"description": "Search for documents matching the query.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query string"},
"max_results": {"type": "integer", "description": "Maximum number of results to return", "default": 10}
},
"required": ["query"]
}
}
Common Patterns
All tool calls should be independent:
def good_tool(x: int) -> int:
# No shared state, idempotent
return x * 2
# Avoid global state
state = {}
def bad_tool(x: int) -> int:
state["count"] = state.get("count", 0) + 1 # Bad!
return x * state["count"]
For stateful tools, use StatefulToolEnv.
Custom Error Messages
Format errors to guide the model:
def error_formatter(e: Exception) -> str:
if isinstance(e, ValueError):
return f"Invalid input: {e}. Please provide a valid number."
elif isinstance(e, KeyError):
return f"Key not found: {e}. Available keys: X, Y, Z."
return f"Error: {e}"
env = vf.ToolEnv(
tools=[...],
error_formatter=error_formatter,
...
)
def efficiency_reward(state: vf.State) -> float:
"""Reward fewer tool calls."""
metrics = state["metrics"]
num_calls = metrics.get("total_tool_calls", 0)
if state["reward"] == 1.0: # Correct answer
return 1.0 / (1 + num_calls) # Fewer calls = higher reward
return 0.0
When to Use
Use ToolEnv for:
- Stateless function calling (calculators, converters, queries)
- API clients (each call is independent)
- Read-only database queries
- File reading operations
- Any idempotent tool
Use StatefulToolEnv for:
- Tools requiring per-rollout state (sandbox IDs, sessions)
- Database transactions
- File writing in isolated environments
- Any tool where state must persist across calls
See Also