Skip to main content

SandboxEnv

Environment for executing commands in isolated Docker containers using Prime Sandboxes.

Overview

SandboxEnv provides isolated container execution with:
  • Container isolation: Each rollout gets its own Docker container
  • Resource control: Configurable CPU, memory, disk, and GPU allocation
  • Persistent containers: Containers persist across multiple commands within a rollout
  • Automatic cleanup: Containers are destroyed after rollout completion
  • Retry logic: Built-in exponential backoff for transient failures

Inheritance

Environment
└── MultiTurnEnv
    └── ToolEnv
        └── StatefulToolEnv
            └── SandboxEnv
                └── PythonEnv

Constructor

SandboxEnv(
    sandbox_name: str = "sandbox-env",
    docker_image: str = "python:3.11-slim",
    start_command: str = "tail -f /dev/null",
    cpu_cores: int = 1,
    memory_gb: int = 2,
    disk_size_gb: int = 5,
    gpu_count: int = 0,
    timeout_minutes: int = 60,
    timeout_per_command_seconds: int = 30,
    environment_vars: dict[str, str] | None = None,
    team_id: str | None = None,
    advanced_configs: AdvancedConfigs | None = None,
    labels: list[str] | None = None,
    max_retries: int = 5,
    base_delay: float = 0.5,
    backoff_factor: float = 2.0,
    max_backoff_seconds: float = 30.0,
    jitter: float = 1e-3,
    stop_errors: list[type[Exception]] | None = None,
    sandbox_client_max_workers: int = 50,
    sandbox_client_max_connections: int = 100,
    sandbox_client_max_keepalive_connections: int = 50,
    **kwargs
)

Parameters

sandbox_name
str
default:"sandbox-env"
Name prefix for created sandboxes.
docker_image
str
default:"python:3.11-slim"
Docker image to use for the sandbox container.
start_command
str
default:"tail -f /dev/null"
Command to run when the container starts. Use tail -f /dev/null to keep container alive for interactive commands.
cpu_cores
int
default:"1"
Number of CPU cores to allocate.
memory_gb
int
default:"2"
Memory allocation in GB.
disk_size_gb
int
default:"5"
Disk space allocation in GB.
gpu_count
int
default:"0"
Number of GPUs to allocate.
timeout_minutes
int
default:"60"
Maximum lifetime of the sandbox container in minutes.
timeout_per_command_seconds
int
default:"30"
Timeout for individual command executions.
environment_vars
dict[str, str] | None
Environment variables to set in the container.
team_id
str | None
Prime Sandboxes team identifier.
advanced_configs
AdvancedConfigs | None
Advanced sandbox configuration options from prime-sandboxes.
labels
list[str] | None
Labels to attach to the sandbox for organization.
max_retries
int
default:"5"
Maximum number of retry attempts for sandbox operations.
base_delay
float
default:"0.5"
Initial delay in seconds for exponential backoff.
backoff_factor
float
default:"2.0"
Multiplier for exponential backoff delays.
max_backoff_seconds
float
default:"30.0"
Maximum delay between retries.
jitter
float
default:"1e-3"
Random jitter added to retry delays to prevent thundering herd.
stop_errors
list[type[Exception]] | None
default:"[vf.SandboxError]"
Exception types that should stop the rollout immediately.
sandbox_client_max_workers
int
default:"50"
Maximum number of worker threads for sandbox client.
sandbox_client_max_connections
int
default:"100"
Maximum number of HTTP connections.
sandbox_client_max_keepalive_connections
int
default:"50"
Maximum number of keepalive HTTP connections.
All other parameters are inherited from StatefulToolEnv.

Tools

bash

async def bash(
    command: str,
    working_dir: str | None = None
) -> str
Execute a bash command in the sandbox container.
command
str
Bash command to execute.
working_dir
str | None
Working directory for command execution. Defaults to container’s default directory.
Returns: str - Combined stdout and stderr output. Output format:
  • stdout content
  • stderr content prefixed with “stderr:” (if any)
  • (no output) if command produced no output
  • Error: Command timed out after Ns on timeout
The sandbox_id and sandbox_state parameters are hidden from the model and injected automatically via update_tool_args().

Core Methods

setup_state

async def setup_state(state: vf.State, **kwargs) -> vf.State
Create a sandbox container for this rollout. Override to customize initialization. State keys added:
  • state["sandbox_id"]: Unique sandbox identifier
  • state["sandbox_state"]: Sandbox metadata (ready status, timing)

get_sandbox_request

def get_sandbox_request(state: vf.State) -> CreateSandboxRequest
Return the sandbox creation request for this rollout. Override to customize per-state configuration.
state
vf.State
Current rollout state.
Returns: CreateSandboxRequest - Sandbox configuration.

post_rollout

async def post_rollout(state: vf.State)
Run custom logic after rollout completes but before sandbox destruction. Override to cache results from the sandbox into state.
state
vf.State
Final rollout state (can be modified).

update_tool_args

def update_tool_args(
    tool_name: str,
    tool_args: dict[str, Any],
    messages: vf.Messages,
    state: vf.State,
    **kwargs
) -> dict[str, Any]
Inject sandbox_id, sandbox_state, and working_dir into bash tool calls. Implemented by SandboxEnv.

Cleanup Methods

destroy_sandbox

@vf.cleanup
async def destroy_sandbox(state: vf.State)
Delete the sandbox after rollout completion. Runs automatically as a cleanup handler.

teardown_sandboxes

@vf.teardown
async def teardown_sandboxes()
Delete all remaining sandboxes during environment shutdown. Runs automatically on exit.

bulk_delete_sandboxes

async def bulk_delete_sandboxes(global_ids: list[str]) -> None
Delete multiple sandboxes by their global IDs in a single operation.
global_ids
list[str]
List of sandbox IDs to delete.

State Management

SandboxEnv adds sandbox-specific state:
state["sandbox_id"] = str           # Unique sandbox identifier
state["sandbox_state"] = {
    "ready": bool,                  # Whether sandbox is ready for commands
    "ready_wait_time": float,       # Time spent waiting for creation
    "command_execution_times": list # Duration of each command in seconds
}

Built-in Rubric

SandboxEnv includes SandboxMonitorRubric which tracks:
  • sandbox_ready_wait_time: Time spent waiting for sandbox creation
  • sandbox_command_execution_time: Average command execution time

Example Usage

Basic File Operations

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {
                "task": "Create a file called hello.txt with 'Hello, World!'",
            },
        ]
    )
    
    def file_created(completion: vf.Messages) -> float:
        """Check if file creation was mentioned."""
        text = str(completion).lower()
        return 1.0 if "hello.txt" in text else 0.0
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(file_created),
        system_prompt="Use bash commands to complete the task.",
        docker_image="ubuntu:22.04",
        max_turns=5
    )

Custom Docker Image

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"task": "Install and run nginx"},
        ]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        system_prompt="Set up and configure the web server.",
        docker_image="ubuntu:22.04",
        cpu_cores=2,
        memory_gb=4,
        max_turns=10
    )

With Environment Variables

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"task": "Use the API_KEY environment variable to authenticate"},
        ]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        system_prompt="Access the API using the provided credentials.",
        environment_vars={
            "API_KEY": "test-key-123",
            "API_URL": "https://api.example.com"
        },
        max_turns=5
    )

GPU-Enabled Sandbox

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"task": "Run a CUDA program to verify GPU access"},
        ]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        system_prompt="Verify GPU availability and run CUDA code.",
        docker_image="nvidia/cuda:12.0-base-ubuntu22.04",
        gpu_count=1,
        memory_gb=8,
        max_turns=10
    )

Custom Start Command

import verifiers as vf

def load_environment():
    # Start a web server on container startup
    start_cmd = """
    bash -c '
    apt-get update && apt-get install -y python3-pip
    pip3 install flask
    python3 -c "from flask import Flask; app = Flask(__name__); app.run(port=8000)" &
    tail -f /dev/null
    '
    """
    
    dataset = vf.Environment.make_dataset(
        [{"task": "Test the Flask server running on port 8000"}]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        system_prompt="Interact with the running Flask server.",
        start_command=start_cmd,
        max_startup_wait_seconds=60,
        max_turns=10
    )

Per-Task Sandbox Configuration

import verifiers as vf
from prime_sandboxes import CreateSandboxRequest

class CustomSandboxEnv(vf.SandboxEnv):
    def get_sandbox_request(self, state: vf.State) -> CreateSandboxRequest:
        """Customize sandbox based on task requirements."""
        task = state["input"]
        
        # Use GPU for ML tasks
        if "machine learning" in task.get("description", "").lower():
            request = self.sandbox_request.model_copy()
            request.gpu_count = 1
            request.memory_gb = 8
            return request
        
        # Default config for other tasks
        return self.sandbox_request.model_copy()

def load_environment():
    dataset = vf.Environment.make_dataset([
        {"description": "Train a machine learning model"},
        {"description": "Process text files"},
    ])
    
    return CustomSandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0)
    )

Caching Sandbox Results

import verifiers as vf

class ResultCachingEnv(vf.SandboxEnv):
    async def post_rollout(self, state: vf.State):
        """Cache results before sandbox is destroyed."""
        # Read files from sandbox and cache in state
        sandbox_id = state["sandbox_id"]
        sandbox_state = state["sandbox_state"]
        
        # Get file contents
        result = await self.bash(
            "cat /tmp/results.json",
            sandbox_id,
            sandbox_state
        )
        
        # Cache in state for reward functions
        state["cached_results"] = result

def load_environment():
    dataset = vf.Environment.make_dataset(
        [{"task": "Generate results.json with analysis"}]
    )
    
    def check_results(state: vf.State) -> float:
        """Use cached results in reward function."""
        results = state.get("cached_results", "")
        return 1.0 if "analysis" in results else 0.0
    
    return ResultCachingEnv(
        dataset=dataset,
        rubric=vf.Rubric(check_results)
    )

With Retry Configuration

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [{"task": "Flaky network operation"}]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        # Custom retry settings
        max_retries=10,
        base_delay=1.0,
        backoff_factor=3.0,
        max_backoff_seconds=60.0,
        max_turns=5
    )

Error Handling

Error Types

  • SandboxCreationError: Failed to create sandbox
  • SandboxNotReadyError: Sandbox failed to become ready
  • vf.SandboxError: Base class for sandbox errors
All inherit from vf.SandboxError which is included in stop_errors by default.

Command Timeouts

Commands that exceed timeout_per_command_seconds return an error message:
"Error: Command timed out after 30s"
The timeout is logged but does not raise an exception, allowing the model to retry or adjust.

Implementation Details

Lazy Initialization

Sandboxes are created during setup_state() but initialization is lazy:
  1. Container creation is queued asynchronously
  2. First bash() call awaits container readiness
  3. Subsequent calls execute immediately
This overlaps provisioning with other rollout setup.

Cleanup Guarantees

Sandboxes are cleaned up via multiple mechanisms:
  1. @vf.cleanup handler runs after each rollout
  2. @vf.teardown handler runs on environment shutdown
  3. Sandboxes auto-destroy after timeout_minutes

Bulk Operations

Use bulk_delete_sandboxes() to delete multiple sandboxes efficiently:
env = vf.SandboxEnv(...)
# ... create sandboxes ...
await env.bulk_delete_sandboxes(["sandbox-1", "sandbox-2", "sandbox-3"])
Batches of up to 100 sandboxes are deleted per API call.

When to Use

Use SandboxEnv for:
  • Code execution in isolated environments
  • System administration tasks
  • File system operations
  • Multi-language environments
  • Security-sensitive operations
Use PythonEnv for:
  • Python-specific REPL workflows
  • Persistent Python state across executions
  • Scientific computing tasks
Use StatefulToolEnv directly for:
  • Non-sandbox stateful resources (databases, APIs)
  • Custom state injection patterns

See Also

Build docs developers (and LLMs) love