SandboxEnv

Environment for executing commands in isolated Docker containers using Prime Sandboxes.

Overview

SandboxEnv provides isolated container execution with:

Container isolation: Each rollout gets its own Docker container
Resource control: Configurable CPU, memory, disk, and GPU allocation
Persistent containers: Containers persist across multiple commands within a rollout
Automatic cleanup: Containers are destroyed after rollout completion
Retry logic: Built-in exponential backoff for transient failures

Inheritance

Environment
└── MultiTurnEnv
    └── ToolEnv
        └── StatefulToolEnv
            └── SandboxEnv
                └── PythonEnv

Constructor

SandboxEnv(
    sandbox_name: str = "sandbox-env",
    docker_image: str = "python:3.11-slim",
    start_command: str = "tail -f /dev/null",
    cpu_cores: int = 1,
    memory_gb: int = 2,
    disk_size_gb: int = 5,
    gpu_count: int = 0,
    timeout_minutes: int = 60,
    timeout_per_command_seconds: int = 30,
    environment_vars: dict[str, str] | None = None,
    team_id: str | None = None,
    advanced_configs: AdvancedConfigs | None = None,
    labels: list[str] | None = None,
    max_retries: int = 5,
    base_delay: float = 0.5,
    backoff_factor: float = 2.0,
    max_backoff_seconds: float = 30.0,
    jitter: float = 1e-3,
    stop_errors: list[type[Exception]] | None = None,
    sandbox_client_max_workers: int = 50,
    sandbox_client_max_connections: int = 100,
    sandbox_client_max_keepalive_connections: int = 50,
    **kwargs
)

Parameters

sandbox_name

str

default:"sandbox-env"

Name prefix for created sandboxes.

docker_image

str

default:"python:3.11-slim"

Docker image to use for the sandbox container.

start_command

str

default:"tail -f /dev/null"

Command to run when the container starts. Use tail -f /dev/null to keep container alive for interactive commands.

cpu_cores

int

default:"1"

Number of CPU cores to allocate.

memory_gb

int

default:"2"

Memory allocation in GB.

disk_size_gb

int

default:"5"

Disk space allocation in GB.

gpu_count

int

default:"0"

Number of GPUs to allocate.

timeout_minutes

int

default:"60"

Maximum lifetime of the sandbox container in minutes.

timeout_per_command_seconds

int

default:"30"

Timeout for individual command executions.

environment_vars

dict[str, str] | None

Environment variables to set in the container.

team_id

str | None

Prime Sandboxes team identifier.

advanced_configs

AdvancedConfigs | None

Advanced sandbox configuration options from prime-sandboxes.

labels

list[str] | None

Labels to attach to the sandbox for organization.

max_retries

int

default:"5"

Maximum number of retry attempts for sandbox operations.

base_delay

float

default:"0.5"

Initial delay in seconds for exponential backoff.

backoff_factor

float

default:"2.0"

Multiplier for exponential backoff delays.

max_backoff_seconds

float

default:"30.0"

Maximum delay between retries.

jitter

float

default:"1e-3"

Random jitter added to retry delays to prevent thundering herd.

stop_errors

list[type[Exception]] | None

default:"[vf.SandboxError]"

Exception types that should stop the rollout immediately.

sandbox_client_max_workers

int

default:"50"

Maximum number of worker threads for sandbox client.

sandbox_client_max_connections

int

default:"100"

Maximum number of HTTP connections.

sandbox_client_max_keepalive_connections

int

default:"50"

Maximum number of keepalive HTTP connections.

All other parameters are inherited from StatefulToolEnv.

Tools

bash

async def bash(
    command: str,
    working_dir: str | None = None
) -> str

Execute a bash command in the sandbox container.

command

str

Bash command to execute.

working_dir

str | None

Working directory for command execution. Defaults to container’s default directory.

Returns: str - Combined stdout and stderr output. Output format:

stdout content
stderr content prefixed with “stderr:” (if any)
(no output) if command produced no output
Error: Command timed out after Ns on timeout

The sandbox_id and sandbox_state parameters are hidden from the model and injected automatically via update_tool_args().

Core Methods

setup_state

async def setup_state(state: vf.State, **kwargs) -> vf.State

Create a sandbox container for this rollout. Override to customize initialization. State keys added:

state["sandbox_id"]: Unique sandbox identifier
state["sandbox_state"]: Sandbox metadata (ready status, timing)

get_sandbox_request

def get_sandbox_request(state: vf.State) -> CreateSandboxRequest

Return the sandbox creation request for this rollout. Override to customize per-state configuration.

state

vf.State

Current rollout state.

Returns: CreateSandboxRequest - Sandbox configuration.

post_rollout

async def post_rollout(state: vf.State)

Run custom logic after rollout completes but before sandbox destruction. Override to cache results from the sandbox into state.

state

vf.State

Final rollout state (can be modified).

update_tool_args

def update_tool_args(
    tool_name: str,
    tool_args: dict[str, Any],
    messages: vf.Messages,
    state: vf.State,
    **kwargs
) -> dict[str, Any]

Inject sandbox_id, sandbox_state, and working_dir into bash tool calls. Implemented by SandboxEnv.

Cleanup Methods

destroy_sandbox

@vf.cleanup
async def destroy_sandbox(state: vf.State)

Delete the sandbox after rollout completion. Runs automatically as a cleanup handler.

teardown_sandboxes

@vf.teardown
async def teardown_sandboxes()

Delete all remaining sandboxes during environment shutdown. Runs automatically on exit.

bulk_delete_sandboxes

async def bulk_delete_sandboxes(global_ids: list[str]) -> None

Delete multiple sandboxes by their global IDs in a single operation.

global_ids

list[str]

List of sandbox IDs to delete.

State Management

SandboxEnv adds sandbox-specific state:

state["sandbox_id"] = str           # Unique sandbox identifier
state["sandbox_state"] = {
    "ready": bool,                  # Whether sandbox is ready for commands
    "ready_wait_time": float,       # Time spent waiting for creation
    "command_execution_times": list # Duration of each command in seconds
}

Built-in Rubric

SandboxEnv includes SandboxMonitorRubric which tracks:

sandbox_ready_wait_time: Time spent waiting for sandbox creation
sandbox_command_execution_time: Average command execution time

Example Usage

Basic File Operations

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {
                "task": "Create a file called hello.txt with 'Hello, World!'",
            },
        ]
    )
    
    def file_created(completion: vf.Messages) -> float:
        """Check if file creation was mentioned."""
        text = str(completion).lower()
        return 1.0 if "hello.txt" in text else 0.0
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(file_created),
        system_prompt="Use bash commands to complete the task.",
        docker_image="ubuntu:22.04",
        max_turns=5
    )

Custom Docker Image

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"task": "Install and run nginx"},
        ]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        system_prompt="Set up and configure the web server.",
        docker_image="ubuntu:22.04",
        cpu_cores=2,
        memory_gb=4,
        max_turns=10
    )

With Environment Variables

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"task": "Use the API_KEY environment variable to authenticate"},
        ]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        system_prompt="Access the API using the provided credentials.",
        environment_vars={
            "API_KEY": "test-key-123",
            "API_URL": "https://api.example.com"
        },
        max_turns=5
    )

GPU-Enabled Sandbox

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [
            {"task": "Run a CUDA program to verify GPU access"},
        ]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        system_prompt="Verify GPU availability and run CUDA code.",
        docker_image="nvidia/cuda:12.0-base-ubuntu22.04",
        gpu_count=1,
        memory_gb=8,
        max_turns=10
    )

Custom Start Command

import verifiers as vf

def load_environment():
    # Start a web server on container startup
    start_cmd = """
    bash -c '
    apt-get update && apt-get install -y python3-pip
    pip3 install flask
    python3 -c "from flask import Flask; app = Flask(__name__); app.run(port=8000)" &
    tail -f /dev/null
    '
    """
    
    dataset = vf.Environment.make_dataset(
        [{"task": "Test the Flask server running on port 8000"}]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        system_prompt="Interact with the running Flask server.",
        start_command=start_cmd,
        max_startup_wait_seconds=60,
        max_turns=10
    )

Per-Task Sandbox Configuration

import verifiers as vf
from prime_sandboxes import CreateSandboxRequest

class CustomSandboxEnv(vf.SandboxEnv):
    def get_sandbox_request(self, state: vf.State) -> CreateSandboxRequest:
        """Customize sandbox based on task requirements."""
        task = state["input"]
        
        # Use GPU for ML tasks
        if "machine learning" in task.get("description", "").lower():
            request = self.sandbox_request.model_copy()
            request.gpu_count = 1
            request.memory_gb = 8
            return request
        
        # Default config for other tasks
        return self.sandbox_request.model_copy()

def load_environment():
    dataset = vf.Environment.make_dataset([
        {"description": "Train a machine learning model"},
        {"description": "Process text files"},
    ])
    
    return CustomSandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0)
    )

Caching Sandbox Results

import verifiers as vf

class ResultCachingEnv(vf.SandboxEnv):
    async def post_rollout(self, state: vf.State):
        """Cache results before sandbox is destroyed."""
        # Read files from sandbox and cache in state
        sandbox_id = state["sandbox_id"]
        sandbox_state = state["sandbox_state"]
        
        # Get file contents
        result = await self.bash(
            "cat /tmp/results.json",
            sandbox_id,
            sandbox_state
        )
        
        # Cache in state for reward functions
        state["cached_results"] = result

def load_environment():
    dataset = vf.Environment.make_dataset(
        [{"task": "Generate results.json with analysis"}]
    )
    
    def check_results(state: vf.State) -> float:
        """Use cached results in reward function."""
        results = state.get("cached_results", "")
        return 1.0 if "analysis" in results else 0.0
    
    return ResultCachingEnv(
        dataset=dataset,
        rubric=vf.Rubric(check_results)
    )

With Retry Configuration

import verifiers as vf

def load_environment():
    dataset = vf.Environment.make_dataset(
        [{"task": "Flaky network operation"}]
    )
    
    return vf.SandboxEnv(
        dataset=dataset,
        rubric=vf.Rubric(lambda c: 1.0),
        # Custom retry settings
        max_retries=10,
        base_delay=1.0,
        backoff_factor=3.0,
        max_backoff_seconds=60.0,
        max_turns=5
    )

Error Handling

Error Types

SandboxCreationError: Failed to create sandbox
SandboxNotReadyError: Sandbox failed to become ready
vf.SandboxError: Base class for sandbox errors

All inherit from vf.SandboxError which is included in stop_errors by default.

Command Timeouts

Commands that exceed timeout_per_command_seconds return an error message:

"Error: Command timed out after 30s"

The timeout is logged but does not raise an exception, allowing the model to retry or adjust.

Implementation Details

Lazy Initialization

Sandboxes are created during setup_state() but initialization is lazy:

Container creation is queued asynchronously
First bash() call awaits container readiness
Subsequent calls execute immediately

This overlaps provisioning with other rollout setup.

Cleanup Guarantees

Sandboxes are cleaned up via multiple mechanisms:

@vf.cleanup handler runs after each rollout
@vf.teardown handler runs on environment shutdown
Sandboxes auto-destroy after timeout_minutes

Bulk Operations

Use bulk_delete_sandboxes() to delete multiple sandboxes efficiently:

env = vf.SandboxEnv(...)
# ... create sandboxes ...
await env.bulk_delete_sandboxes(["sandbox-1", "sandbox-2", "sandbox-3"])

Batches of up to 100 sandboxes are deleted per API call.

When to Use

Use SandboxEnv for:

Code execution in isolated environments
System administration tasks
File system operations
Multi-language environments
Security-sensitive operations

Use PythonEnv for:

Python-specific REPL workflows
Persistent Python state across executions
Scientific computing tasks

Use StatefulToolEnv directly for:

Non-sandbox stateful resources (databases, APIs)
Custom state injection patterns

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​SandboxEnv

​Overview

​Inheritance

​Constructor

​Parameters

​Tools

​bash

​Core Methods

​setup_state

​get_sandbox_request

​post_rollout

​update_tool_args

​Cleanup Methods

​destroy_sandbox

​teardown_sandboxes

​bulk_delete_sandboxes

​State Management

​Built-in Rubric

​Example Usage

​Basic File Operations

​Custom Docker Image

​With Environment Variables

​GPU-Enabled Sandbox

​Custom Start Command

​Per-Task Sandbox Configuration

​Caching Sandbox Results

​With Retry Configuration

​Error Handling

​Error Types

​Command Timeouts

​Implementation Details

​Lazy Initialization

​Cleanup Guarantees

​Bulk Operations

​When to Use

​See Also

Build docs developers (and LLMs) love

SandboxEnv

Overview

Inheritance

Constructor

Parameters

Tools

bash

Core Methods

setup_state

get_sandbox_request

post_rollout

update_tool_args

Cleanup Methods

destroy_sandbox

teardown_sandboxes

bulk_delete_sandboxes

State Management

Built-in Rubric

Example Usage

Basic File Operations

Custom Docker Image

With Environment Variables

GPU-Enabled Sandbox

Custom Start Command

Per-Task Sandbox Configuration

Caching Sandbox Results

With Retry Configuration

Error Handling

Error Types

Command Timeouts

Implementation Details

Lazy Initialization

Cleanup Guarantees

Bulk Operations

When to Use

See Also