Skip to main content

CliAgentEnv

An environment for executing full agent implementations inside sandboxes, intercepting their API calls to control model interactions.
CliAgentEnv is experimental and subject to breaking changes. The API may change in future releases.

Overview

CliAgentEnv enables running arbitrary agent code (Python scripts, Node.js apps, etc.) in isolated sandbox containers. It:
  • Intercepts the agent’s API requests via HTTP proxy
  • Translates requests to Verifiers’ multi-turn rollout loop
  • Manages sandbox lifecycle and resource provisioning
  • Monitors agent process completion and timeouts

Inheritance

Environment
└── MultiTurnEnv
    └── CliAgentEnv (with SandboxMixin)

Constructor

CliAgentEnv(
    run_command: str,
    interception_port: int = 8765,
    interception_url: str | None = None,
    max_turns: int = -1,
    timeout_seconds: float = 3600.0,
    poll_interval: float = 5.0,
    docker_image: str = "python:3.11-slim",
    start_command: str = "tail -f /dev/null",
    cpu_cores: int = 1,
    memory_gb: int = 2,
    disk_size_gb: int = 5,
    gpu_count: int = 0,
    timeout_minutes: int = 60,
    environment_vars: dict[str, str] | None = None,
    **kwargs
)
run_command
str
required
Command to start the agent inside the sandbox (e.g., "python agent.py").
interception_port
int
default:"8765"
Local port for the HTTP interception server.
interception_url
str | None
default:"None"
Optional external URL for interception. If None, uses Prime Tunnel.
max_turns
int
default:"-1"
Maximum API calls per rollout. -1 for unlimited.
timeout_seconds
float
default:"3600.0"
Rollout timeout in seconds.
poll_interval
float
default:"5.0"
Interval for checking agent completion.
docker_image
str
default:"python:3.11-slim"
Docker image for the sandbox.
start_command
str
default:"tail -f /dev/null"
Initial command to keep sandbox alive.
cpu_cores
int
default:"1"
CPU cores allocated to sandbox.
memory_gb
int
default:"2"
Memory in GB allocated to sandbox.
disk_size_gb
int
default:"5"
Disk size in GB for sandbox.
gpu_count
int
default:"0"
Number of GPUs allocated to sandbox.
environment_vars
dict[str, str] | None
Custom environment variables for the sandbox.

How It Works

1. Interception Setup

2. Environment Variables

The agent receives:
OPENAI_BASE_URL=<tunnel_url>/rollout/<rollout_id>/v1
OPENAI_MODEL=<model_name>
OPENAI_TIMEOUT=600
OPENAI_REQUEST_TIMEOUT=600
HTTPX_TIMEOUT=600
# Plus any custom vars from environment_vars parameter

3. Lifecycle

  1. Setup: Create sandbox, start tunnel, upload assets
  2. Execution: Launch agent via background job
  3. Interception: Handle each API request as a turn
  4. Completion: Detect agent exit or timeout
  5. Cleanup: Destroy sandbox, stop tunnel

Example Usage

Basic Python Agent

import verifiers as vf
from pathlib import Path

def load_environment():
    # Dataset with tasks
    dataset = vf.Environment.make_dataset([
        {"question": "Write a function to reverse a string"},
        {"question": "Debug this code: print('hello'"},
    ])
    
    def task_success(completion: vf.Messages) -> float:
        """Reward based on agent completing task."""
        return 1.0 if len(completion) > 0 else 0.0
    
    return vf.CliAgentEnv(
        run_command="python /app/agent.py",
        dataset=dataset,
        rubric=vf.Rubric(task_success),
        docker_image="python:3.11",
        max_turns=10,
        timeout_seconds=300,
    )

# Agent code (agent.py)
"""
import os
from openai import OpenAI

client = OpenAI()  # Uses OPENAI_BASE_URL from env

instruction = os.environ.get("HARBOR_INSTRUCTION_PATH")
if instruction:
    with open(instruction) as f:
        task = f.read()
else:
    task = "Complete the coding task"

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": task}]
)

print(response.choices[0].message.content)
"""

With Custom Asset Upload

import verifiers as vf
from pathlib import Path

class CustomAgentEnv(vf.CliAgentEnv):
    async def post_sandbox_setup(self, state: vf.State) -> None:
        """Upload agent code and dependencies after sandbox creation."""
        sandbox_id = state["sandbox_id"]
        
        # Upload agent script
        await self.sandbox_client.upload_file(
            sandbox_id,
            "/app/agent.py",
            "./agent_code/main.py"
        )
        
        # Upload requirements
        await self.sandbox_client.upload_file(
            sandbox_id,
            "/app/requirements.txt",
            "./agent_code/requirements.txt"
        )
        
        # Install dependencies
        await self.sandbox_client.execute_command(
            sandbox_id,
            "pip install -r requirements.txt",
            working_dir="/app"
        )

def load_environment():
    dataset = vf.Environment.make_dataset([{"question": "Task 1"}])
    
    return CustomAgentEnv(
        run_command="python /app/agent.py",
        dataset=dataset,
        rubric=vf.Rubric(lambda **kw: 1.0),
    )

Per-Task Docker Images

import verifiers as vf

class MultiImageEnv(vf.CliAgentEnv):
    async def get_docker_image(self, state: vf.State) -> str:
        """Select Docker image based on task."""
        task_type = state.get("info", {}).get("language")
        
        if task_type == "python":
            return "python:3.11"
        elif task_type == "node":
            return "node:20-slim"
        else:
            return self.docker_image  # Fallback

def load_environment():
    dataset = vf.Environment.make_dataset([
        {"question": "Python task", "info": {"language": "python"}},
        {"question": "Node task", "info": {"language": "node"}},
    ])
    
    return MultiImageEnv(
        run_command="python /app/agent.py",
        dataset=dataset,
        rubric=vf.Rubric(lambda **kw: 1.0),
    )

Key Methods

build_env_vars

async def build_env_vars(self, state: vf.State) -> dict[str, str]
Build environment variables for the sandbox. Override to add custom variables:
class CustomEnv(vf.CliAgentEnv):
    async def build_env_vars(self, state: vf.State) -> dict[str, str]:
        env_vars = await super().build_env_vars(state)
        env_vars["CUSTOM_VAR"] = "value"
        env_vars["TASK_ID"] = state.get("task", "")
        return env_vars

post_sandbox_setup

async def post_sandbox_setup(self, state: vf.State) -> None
Hook called after sandbox creation but before agent starts. Use for file uploads, dependency installation, etc.

post_rollout

async def post_rollout(self, state: vf.State) -> None
Hook called after agent completes but before sandbox destruction. Use for extracting artifacts or computing rewards that require sandbox access:
class ArtifactEnv(vf.CliAgentEnv):
    async def post_rollout(self, state: vf.State):
        sandbox_id = state["sandbox_id"]
        
        # Download agent output
        result = await self.sandbox_client.execute_command(
            sandbox_id,
            "cat /app/output.json",
            working_dir="/app"
        )
        
        state["agent_output"] = result.stdout

State Keys

CliAgentEnv adds these state keys:
rollout_id
str
Unique identifier for this rollout.
sandbox_id
str
Prime Sandbox ID.
interception_base_url
str
Full interception URL with rollout ID.
agent_completed
bool
Whether the agent process has finished.
agent_exit_code
int
Agent process exit code.
agent_stdout
str
Captured stdout from agent.
agent_stderr
str
Captured stderr from agent.
agent_timed_out
bool
Whether agent exceeded timeout.

Stop Conditions

Rollout stops when:
  1. Agent process exits (detected via background job polling)
  2. timeout_seconds is exceeded
  3. max_turns is reached

Error Handling

  • Sandbox creation failure: Raises SandboxCreationError
  • Tunnel failure: Raises TunnelError with frpc logs
  • Agent timeout: Sets state["agent_timed_out"] = True
  • Infrastructure errors: Sets state["error"] to InfraError instance

Debugging

Enable detailed logging:
import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("verifiers.envs.experimental.cli_agent_env")
logger.setLevel(logging.DEBUG)

# Also enable httpx logs for tunnel debugging
import os
os.environ["HTTPX_LOG_LEVEL"] = "DEBUG"

Limitations

  • Streaming: Agent must use non-streaming API calls (streaming synthesis is WIP)
  • Tool calls: Agent can use tools, but schemas are normalized to OpenAI format
  • Timeouts: Long-running agents may hit sandbox timeout limits

See Also

Build docs developers (and LLMs) love