CliAgentEnv
An environment for executing full agent implementations inside sandboxes, intercepting their API calls to control model interactions.
CliAgentEnv is experimental and subject to breaking changes. The API may change in future releases.
Overview
CliAgentEnv enables running arbitrary agent code (Python scripts, Node.js apps, etc.) in isolated sandbox containers. It:
- Intercepts the agent’s API requests via HTTP proxy
- Translates requests to Verifiers’ multi-turn rollout loop
- Manages sandbox lifecycle and resource provisioning
- Monitors agent process completion and timeouts
Inheritance
Environment
└── MultiTurnEnv
└── CliAgentEnv (with SandboxMixin)
Constructor
CliAgentEnv(
run_command: str,
interception_port: int = 8765,
interception_url: str | None = None,
max_turns: int = -1,
timeout_seconds: float = 3600.0,
poll_interval: float = 5.0,
docker_image: str = "python:3.11-slim",
start_command: str = "tail -f /dev/null",
cpu_cores: int = 1,
memory_gb: int = 2,
disk_size_gb: int = 5,
gpu_count: int = 0,
timeout_minutes: int = 60,
environment_vars: dict[str, str] | None = None,
**kwargs
)
Command to start the agent inside the sandbox (e.g., "python agent.py").
Local port for the HTTP interception server.
Optional external URL for interception. If None, uses Prime Tunnel.
Maximum API calls per rollout. -1 for unlimited.
Rollout timeout in seconds.
Interval for checking agent completion.
docker_image
str
default:"python:3.11-slim"
Docker image for the sandbox.
start_command
str
default:"tail -f /dev/null"
Initial command to keep sandbox alive.
CPU cores allocated to sandbox.
Memory in GB allocated to sandbox.
Disk size in GB for sandbox.
Number of GPUs allocated to sandbox.
Custom environment variables for the sandbox.
How It Works
1. Interception Setup
2. Environment Variables
The agent receives:
OPENAI_BASE_URL=<tunnel_url>/rollout/<rollout_id>/v1
OPENAI_MODEL=<model_name>
OPENAI_TIMEOUT=600
OPENAI_REQUEST_TIMEOUT=600
HTTPX_TIMEOUT=600
# Plus any custom vars from environment_vars parameter
3. Lifecycle
- Setup: Create sandbox, start tunnel, upload assets
- Execution: Launch agent via background job
- Interception: Handle each API request as a turn
- Completion: Detect agent exit or timeout
- Cleanup: Destroy sandbox, stop tunnel
Example Usage
Basic Python Agent
import verifiers as vf
from pathlib import Path
def load_environment():
# Dataset with tasks
dataset = vf.Environment.make_dataset([
{"question": "Write a function to reverse a string"},
{"question": "Debug this code: print('hello'"},
])
def task_success(completion: vf.Messages) -> float:
"""Reward based on agent completing task."""
return 1.0 if len(completion) > 0 else 0.0
return vf.CliAgentEnv(
run_command="python /app/agent.py",
dataset=dataset,
rubric=vf.Rubric(task_success),
docker_image="python:3.11",
max_turns=10,
timeout_seconds=300,
)
# Agent code (agent.py)
"""
import os
from openai import OpenAI
client = OpenAI() # Uses OPENAI_BASE_URL from env
instruction = os.environ.get("HARBOR_INSTRUCTION_PATH")
if instruction:
with open(instruction) as f:
task = f.read()
else:
task = "Complete the coding task"
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": task}]
)
print(response.choices[0].message.content)
"""
With Custom Asset Upload
import verifiers as vf
from pathlib import Path
class CustomAgentEnv(vf.CliAgentEnv):
async def post_sandbox_setup(self, state: vf.State) -> None:
"""Upload agent code and dependencies after sandbox creation."""
sandbox_id = state["sandbox_id"]
# Upload agent script
await self.sandbox_client.upload_file(
sandbox_id,
"/app/agent.py",
"./agent_code/main.py"
)
# Upload requirements
await self.sandbox_client.upload_file(
sandbox_id,
"/app/requirements.txt",
"./agent_code/requirements.txt"
)
# Install dependencies
await self.sandbox_client.execute_command(
sandbox_id,
"pip install -r requirements.txt",
working_dir="/app"
)
def load_environment():
dataset = vf.Environment.make_dataset([{"question": "Task 1"}])
return CustomAgentEnv(
run_command="python /app/agent.py",
dataset=dataset,
rubric=vf.Rubric(lambda **kw: 1.0),
)
Per-Task Docker Images
import verifiers as vf
class MultiImageEnv(vf.CliAgentEnv):
async def get_docker_image(self, state: vf.State) -> str:
"""Select Docker image based on task."""
task_type = state.get("info", {}).get("language")
if task_type == "python":
return "python:3.11"
elif task_type == "node":
return "node:20-slim"
else:
return self.docker_image # Fallback
def load_environment():
dataset = vf.Environment.make_dataset([
{"question": "Python task", "info": {"language": "python"}},
{"question": "Node task", "info": {"language": "node"}},
])
return MultiImageEnv(
run_command="python /app/agent.py",
dataset=dataset,
rubric=vf.Rubric(lambda **kw: 1.0),
)
Key Methods
build_env_vars
async def build_env_vars(self, state: vf.State) -> dict[str, str]
Build environment variables for the sandbox. Override to add custom variables:
class CustomEnv(vf.CliAgentEnv):
async def build_env_vars(self, state: vf.State) -> dict[str, str]:
env_vars = await super().build_env_vars(state)
env_vars["CUSTOM_VAR"] = "value"
env_vars["TASK_ID"] = state.get("task", "")
return env_vars
post_sandbox_setup
async def post_sandbox_setup(self, state: vf.State) -> None
Hook called after sandbox creation but before agent starts. Use for file uploads, dependency installation, etc.
post_rollout
async def post_rollout(self, state: vf.State) -> None
Hook called after agent completes but before sandbox destruction. Use for extracting artifacts or computing rewards that require sandbox access:
class ArtifactEnv(vf.CliAgentEnv):
async def post_rollout(self, state: vf.State):
sandbox_id = state["sandbox_id"]
# Download agent output
result = await self.sandbox_client.execute_command(
sandbox_id,
"cat /app/output.json",
working_dir="/app"
)
state["agent_output"] = result.stdout
State Keys
CliAgentEnv adds these state keys:
Unique identifier for this rollout.
Full interception URL with rollout ID.
Whether the agent process has finished.
Captured stdout from agent.
Captured stderr from agent.
Whether agent exceeded timeout.
Stop Conditions
Rollout stops when:
- Agent process exits (detected via background job polling)
timeout_seconds is exceeded
max_turns is reached
Error Handling
- Sandbox creation failure: Raises
SandboxCreationError
- Tunnel failure: Raises
TunnelError with frpc logs
- Agent timeout: Sets
state["agent_timed_out"] = True
- Infrastructure errors: Sets
state["error"] to InfraError instance
Debugging
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("verifiers.envs.experimental.cli_agent_env")
logger.setLevel(logging.DEBUG)
# Also enable httpx logs for tunnel debugging
import os
os.environ["HTTPX_LOG_LEVEL"] = "DEBUG"
Limitations
- Streaming: Agent must use non-streaming API calls (streaming synthesis is WIP)
- Tool calls: Agent can use tools, but schemas are normalized to OpenAI format
- Timeouts: Long-running agents may hit sandbox timeout limits
See Also