CliAgentEnv

An environment for executing full agent implementations inside sandboxes, intercepting their API calls to control model interactions.

CliAgentEnv is experimental and subject to breaking changes. The API may change in future releases.

Overview

CliAgentEnv enables running arbitrary agent code (Python scripts, Node.js apps, etc.) in isolated sandbox containers. It:

Intercepts the agent’s API requests via HTTP proxy
Translates requests to Verifiers’ multi-turn rollout loop
Manages sandbox lifecycle and resource provisioning
Monitors agent process completion and timeouts

Inheritance

Environment
└── MultiTurnEnv
    └── CliAgentEnv (with SandboxMixin)

Constructor

CliAgentEnv(
    run_command: str,
    interception_port: int = 8765,
    interception_url: str | None = None,
    max_turns: int = -1,
    timeout_seconds: float = 3600.0,
    poll_interval: float = 5.0,
    docker_image: str = "python:3.11-slim",
    start_command: str = "tail -f /dev/null",
    cpu_cores: int = 1,
    memory_gb: int = 2,
    disk_size_gb: int = 5,
    gpu_count: int = 0,
    timeout_minutes: int = 60,
    environment_vars: dict[str, str] | None = None,
    **kwargs
)

run_command

str

required

Command to start the agent inside the sandbox (e.g., "python agent.py").

interception_port

int

default:"8765"

Local port for the HTTP interception server.

interception_url

str | None

default:"None"

Optional external URL for interception. If None, uses Prime Tunnel.

max_turns

int

default:"-1"

Maximum API calls per rollout. -1 for unlimited.

timeout_seconds

float

default:"3600.0"

Rollout timeout in seconds.

poll_interval

float

default:"5.0"

Interval for checking agent completion.

docker_image

str

default:"python:3.11-slim"

Docker image for the sandbox.

start_command

str

default:"tail -f /dev/null"

Initial command to keep sandbox alive.

cpu_cores

int

default:"1"

CPU cores allocated to sandbox.

memory_gb

int

default:"2"

Memory in GB allocated to sandbox.

disk_size_gb

int

default:"5"

Disk size in GB for sandbox.

gpu_count

int

default:"0"

Number of GPUs allocated to sandbox.

environment_vars

dict[str, str] | None

Custom environment variables for the sandbox.

How It Works

1. Interception Setup

2. Environment Variables

The agent receives:

OPENAI_BASE_URL=<tunnel_url>/rollout/<rollout_id>/v1
OPENAI_MODEL=<model_name>
OPENAI_TIMEOUT=600
OPENAI_REQUEST_TIMEOUT=600
HTTPX_TIMEOUT=600
# Plus any custom vars from environment_vars parameter

3. Lifecycle

Setup: Create sandbox, start tunnel, upload assets
Execution: Launch agent via background job
Interception: Handle each API request as a turn
Completion: Detect agent exit or timeout
Cleanup: Destroy sandbox, stop tunnel

Example Usage

Basic Python Agent

import verifiers as vf
from pathlib import Path

def load_environment():
    # Dataset with tasks
    dataset = vf.Environment.make_dataset([
        {"question": "Write a function to reverse a string"},
        {"question": "Debug this code: print('hello'"},
    ])
    
    def task_success(completion: vf.Messages) -> float:
        """Reward based on agent completing task."""
        return 1.0 if len(completion) > 0 else 0.0
    
    return vf.CliAgentEnv(
        run_command="python /app/agent.py",
        dataset=dataset,
        rubric=vf.Rubric(task_success),
        docker_image="python:3.11",
        max_turns=10,
        timeout_seconds=300,
    )

# Agent code (agent.py)
"""
import os
from openai import OpenAI

client = OpenAI()  # Uses OPENAI_BASE_URL from env

instruction = os.environ.get("HARBOR_INSTRUCTION_PATH")
if instruction:
    with open(instruction) as f:
        task = f.read()
else:
    task = "Complete the coding task"

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": task}]
)

print(response.choices[0].message.content)
"""

With Custom Asset Upload

import verifiers as vf
from pathlib import Path

class CustomAgentEnv(vf.CliAgentEnv):
    async def post_sandbox_setup(self, state: vf.State) -> None:
        """Upload agent code and dependencies after sandbox creation."""
        sandbox_id = state["sandbox_id"]
        
        # Upload agent script
        await self.sandbox_client.upload_file(
            sandbox_id,
            "/app/agent.py",
            "./agent_code/main.py"
        )
        
        # Upload requirements
        await self.sandbox_client.upload_file(
            sandbox_id,
            "/app/requirements.txt",
            "./agent_code/requirements.txt"
        )
        
        # Install dependencies
        await self.sandbox_client.execute_command(
            sandbox_id,
            "pip install -r requirements.txt",
            working_dir="/app"
        )

def load_environment():
    dataset = vf.Environment.make_dataset([{"question": "Task 1"}])
    
    return CustomAgentEnv(
        run_command="python /app/agent.py",
        dataset=dataset,
        rubric=vf.Rubric(lambda **kw: 1.0),
    )

Per-Task Docker Images

import verifiers as vf

class MultiImageEnv(vf.CliAgentEnv):
    async def get_docker_image(self, state: vf.State) -> str:
        """Select Docker image based on task."""
        task_type = state.get("info", {}).get("language")
        
        if task_type == "python":
            return "python:3.11"
        elif task_type == "node":
            return "node:20-slim"
        else:
            return self.docker_image  # Fallback

def load_environment():
    dataset = vf.Environment.make_dataset([
        {"question": "Python task", "info": {"language": "python"}},
        {"question": "Node task", "info": {"language": "node"}},
    ])
    
    return MultiImageEnv(
        run_command="python /app/agent.py",
        dataset=dataset,
        rubric=vf.Rubric(lambda **kw: 1.0),
    )

Key Methods

build_env_vars

async def build_env_vars(self, state: vf.State) -> dict[str, str]

Build environment variables for the sandbox. Override to add custom variables:

class CustomEnv(vf.CliAgentEnv):
    async def build_env_vars(self, state: vf.State) -> dict[str, str]:
        env_vars = await super().build_env_vars(state)
        env_vars["CUSTOM_VAR"] = "value"
        env_vars["TASK_ID"] = state.get("task", "")
        return env_vars

post_sandbox_setup

async def post_sandbox_setup(self, state: vf.State) -> None

Hook called after sandbox creation but before agent starts. Use for file uploads, dependency installation, etc.

post_rollout

async def post_rollout(self, state: vf.State) -> None

Hook called after agent completes but before sandbox destruction. Use for extracting artifacts or computing rewards that require sandbox access:

class ArtifactEnv(vf.CliAgentEnv):
    async def post_rollout(self, state: vf.State):
        sandbox_id = state["sandbox_id"]
        
        # Download agent output
        result = await self.sandbox_client.execute_command(
            sandbox_id,
            "cat /app/output.json",
            working_dir="/app"
        )
        
        state["agent_output"] = result.stdout

State Keys

CliAgentEnv adds these state keys:

rollout_id

str

Unique identifier for this rollout.

sandbox_id

str

Prime Sandbox ID.

interception_base_url

str

Full interception URL with rollout ID.

agent_completed

bool

Whether the agent process has finished.

agent_exit_code

int

Agent process exit code.

agent_stdout

str

Captured stdout from agent.

agent_stderr

str

Captured stderr from agent.

agent_timed_out

bool

Whether agent exceeded timeout.

Stop Conditions

Rollout stops when:

Agent process exits (detected via background job polling)
timeout_seconds is exceeded
max_turns is reached

Error Handling

Sandbox creation failure: Raises SandboxCreationError
Tunnel failure: Raises TunnelError with frpc logs
Agent timeout: Sets state["agent_timed_out"] = True
Infrastructure errors: Sets state["error"] to InfraError instance

Debugging

Enable detailed logging:

import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("verifiers.envs.experimental.cli_agent_env")
logger.setLevel(logging.DEBUG)

# Also enable httpx logs for tunnel debugging
import os
os.environ["HTTPX_LOG_LEVEL"] = "DEBUG"

Limitations

Streaming: Agent must use non-streaming API calls (streaming synthesis is WIP)
Tool calls: Agent can use tools, but schemas are normalized to OpenAI format
Timeouts: Long-running agents may hit sandbox timeout limits

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

CliAgentEnv

CliAgentEnv

Overview

Inheritance

Constructor

How It Works

1. Interception Setup

2. Environment Variables

3. Lifecycle

Example Usage

Basic Python Agent

With Custom Asset Upload

Per-Task Docker Images

Key Methods

build_env_vars

post_sandbox_setup

post_rollout

State Keys

Stop Conditions

Error Handling

Debugging

Limitations

See Also

Build docs developers (and LLMs) love

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​CliAgentEnv

​Overview

​Inheritance

​Constructor

​How It Works

​1. Interception Setup

​2. Environment Variables

​3. Lifecycle

​Example Usage

​Basic Python Agent

​With Custom Asset Upload

​Per-Task Docker Images

​Key Methods

​build_env_vars

​post_sandbox_setup

​post_rollout

​State Keys

​Stop Conditions

​Error Handling

​Debugging

​Limitations

​See Also

Build docs developers (and LLMs) love

CliAgentEnv

Overview

Inheritance

Constructor

How It Works

1. Interception Setup

2. Environment Variables

3. Lifecycle

Example Usage

Basic Python Agent

With Custom Asset Upload

Per-Task Docker Images

Key Methods

build_env_vars

post_sandbox_setup

post_rollout

State Keys

Stop Conditions

Error Handling

Debugging

Limitations

See Also