SandboxEnv
Environment for executing commands in isolated Docker containers using Prime Sandboxes.
Overview
SandboxEnv provides isolated container execution with:
- Container isolation: Each rollout gets its own Docker container
- Resource control: Configurable CPU, memory, disk, and GPU allocation
- Persistent containers: Containers persist across multiple commands within a rollout
- Automatic cleanup: Containers are destroyed after rollout completion
- Retry logic: Built-in exponential backoff for transient failures
Inheritance
Environment
└── MultiTurnEnv
└── ToolEnv
└── StatefulToolEnv
└── SandboxEnv
└── PythonEnv
Constructor
SandboxEnv(
sandbox_name: str = "sandbox-env",
docker_image: str = "python:3.11-slim",
start_command: str = "tail -f /dev/null",
cpu_cores: int = 1,
memory_gb: int = 2,
disk_size_gb: int = 5,
gpu_count: int = 0,
timeout_minutes: int = 60,
timeout_per_command_seconds: int = 30,
environment_vars: dict[str, str] | None = None,
team_id: str | None = None,
advanced_configs: AdvancedConfigs | None = None,
labels: list[str] | None = None,
max_retries: int = 5,
base_delay: float = 0.5,
backoff_factor: float = 2.0,
max_backoff_seconds: float = 30.0,
jitter: float = 1e-3,
stop_errors: list[type[Exception]] | None = None,
sandbox_client_max_workers: int = 50,
sandbox_client_max_connections: int = 100,
sandbox_client_max_keepalive_connections: int = 50,
**kwargs
)
Parameters
Name prefix for created sandboxes.
docker_image
str
default:"python:3.11-slim"
Docker image to use for the sandbox container.
start_command
str
default:"tail -f /dev/null"
Command to run when the container starts. Use tail -f /dev/null to keep container alive for interactive commands.
Number of CPU cores to allocate.
Disk space allocation in GB.
Number of GPUs to allocate.
Maximum lifetime of the sandbox container in minutes.
timeout_per_command_seconds
Timeout for individual command executions.
Environment variables to set in the container.
Prime Sandboxes team identifier.
Advanced sandbox configuration options from prime-sandboxes.
Labels to attach to the sandbox for organization.
Maximum number of retry attempts for sandbox operations.
Initial delay in seconds for exponential backoff.
Multiplier for exponential backoff delays.
Maximum delay between retries.
Random jitter added to retry delays to prevent thundering herd.
stop_errors
list[type[Exception]] | None
default:"[vf.SandboxError]"
Exception types that should stop the rollout immediately.
sandbox_client_max_workers
Maximum number of worker threads for sandbox client.
sandbox_client_max_connections
Maximum number of HTTP connections.
sandbox_client_max_keepalive_connections
Maximum number of keepalive HTTP connections.
All other parameters are inherited from StatefulToolEnv.
bash
async def bash(
command: str,
working_dir: str | None = None
) -> str
Execute a bash command in the sandbox container.
Working directory for command execution. Defaults to container’s default directory.
Returns: str - Combined stdout and stderr output.
Output format:
- stdout content
- stderr content prefixed with “stderr:” (if any)
(no output) if command produced no output
Error: Command timed out after Ns on timeout
The sandbox_id and sandbox_state parameters are hidden from the model and injected automatically via update_tool_args().
Core Methods
setup_state
async def setup_state(state: vf.State, **kwargs) -> vf.State
Create a sandbox container for this rollout. Override to customize initialization.
State keys added:
state["sandbox_id"]: Unique sandbox identifier
state["sandbox_state"]: Sandbox metadata (ready status, timing)
get_sandbox_request
def get_sandbox_request(state: vf.State) -> CreateSandboxRequest
Return the sandbox creation request for this rollout. Override to customize per-state configuration.
Returns: CreateSandboxRequest - Sandbox configuration.
post_rollout
async def post_rollout(state: vf.State)
Run custom logic after rollout completes but before sandbox destruction. Override to cache results from the sandbox into state.
Final rollout state (can be modified).
def update_tool_args(
tool_name: str,
tool_args: dict[str, Any],
messages: vf.Messages,
state: vf.State,
**kwargs
) -> dict[str, Any]
Inject sandbox_id, sandbox_state, and working_dir into bash tool calls. Implemented by SandboxEnv.
Cleanup Methods
destroy_sandbox
@vf.cleanup
async def destroy_sandbox(state: vf.State)
Delete the sandbox after rollout completion. Runs automatically as a cleanup handler.
teardown_sandboxes
@vf.teardown
async def teardown_sandboxes()
Delete all remaining sandboxes during environment shutdown. Runs automatically on exit.
bulk_delete_sandboxes
async def bulk_delete_sandboxes(global_ids: list[str]) -> None
Delete multiple sandboxes by their global IDs in a single operation.
List of sandbox IDs to delete.
State Management
SandboxEnv adds sandbox-specific state:
state["sandbox_id"] = str # Unique sandbox identifier
state["sandbox_state"] = {
"ready": bool, # Whether sandbox is ready for commands
"ready_wait_time": float, # Time spent waiting for creation
"command_execution_times": list # Duration of each command in seconds
}
Built-in Rubric
SandboxEnv includes SandboxMonitorRubric which tracks:
sandbox_ready_wait_time: Time spent waiting for sandbox creation
sandbox_command_execution_time: Average command execution time
Example Usage
Basic File Operations
import verifiers as vf
def load_environment():
dataset = vf.Environment.make_dataset(
[
{
"task": "Create a file called hello.txt with 'Hello, World!'",
},
]
)
def file_created(completion: vf.Messages) -> float:
"""Check if file creation was mentioned."""
text = str(completion).lower()
return 1.0 if "hello.txt" in text else 0.0
return vf.SandboxEnv(
dataset=dataset,
rubric=vf.Rubric(file_created),
system_prompt="Use bash commands to complete the task.",
docker_image="ubuntu:22.04",
max_turns=5
)
Custom Docker Image
import verifiers as vf
def load_environment():
dataset = vf.Environment.make_dataset(
[
{"task": "Install and run nginx"},
]
)
return vf.SandboxEnv(
dataset=dataset,
rubric=vf.Rubric(lambda c: 1.0),
system_prompt="Set up and configure the web server.",
docker_image="ubuntu:22.04",
cpu_cores=2,
memory_gb=4,
max_turns=10
)
With Environment Variables
import verifiers as vf
def load_environment():
dataset = vf.Environment.make_dataset(
[
{"task": "Use the API_KEY environment variable to authenticate"},
]
)
return vf.SandboxEnv(
dataset=dataset,
rubric=vf.Rubric(lambda c: 1.0),
system_prompt="Access the API using the provided credentials.",
environment_vars={
"API_KEY": "test-key-123",
"API_URL": "https://api.example.com"
},
max_turns=5
)
GPU-Enabled Sandbox
import verifiers as vf
def load_environment():
dataset = vf.Environment.make_dataset(
[
{"task": "Run a CUDA program to verify GPU access"},
]
)
return vf.SandboxEnv(
dataset=dataset,
rubric=vf.Rubric(lambda c: 1.0),
system_prompt="Verify GPU availability and run CUDA code.",
docker_image="nvidia/cuda:12.0-base-ubuntu22.04",
gpu_count=1,
memory_gb=8,
max_turns=10
)
Custom Start Command
import verifiers as vf
def load_environment():
# Start a web server on container startup
start_cmd = """
bash -c '
apt-get update && apt-get install -y python3-pip
pip3 install flask
python3 -c "from flask import Flask; app = Flask(__name__); app.run(port=8000)" &
tail -f /dev/null
'
"""
dataset = vf.Environment.make_dataset(
[{"task": "Test the Flask server running on port 8000"}]
)
return vf.SandboxEnv(
dataset=dataset,
rubric=vf.Rubric(lambda c: 1.0),
system_prompt="Interact with the running Flask server.",
start_command=start_cmd,
max_startup_wait_seconds=60,
max_turns=10
)
Per-Task Sandbox Configuration
import verifiers as vf
from prime_sandboxes import CreateSandboxRequest
class CustomSandboxEnv(vf.SandboxEnv):
def get_sandbox_request(self, state: vf.State) -> CreateSandboxRequest:
"""Customize sandbox based on task requirements."""
task = state["input"]
# Use GPU for ML tasks
if "machine learning" in task.get("description", "").lower():
request = self.sandbox_request.model_copy()
request.gpu_count = 1
request.memory_gb = 8
return request
# Default config for other tasks
return self.sandbox_request.model_copy()
def load_environment():
dataset = vf.Environment.make_dataset([
{"description": "Train a machine learning model"},
{"description": "Process text files"},
])
return CustomSandboxEnv(
dataset=dataset,
rubric=vf.Rubric(lambda c: 1.0)
)
Caching Sandbox Results
import verifiers as vf
class ResultCachingEnv(vf.SandboxEnv):
async def post_rollout(self, state: vf.State):
"""Cache results before sandbox is destroyed."""
# Read files from sandbox and cache in state
sandbox_id = state["sandbox_id"]
sandbox_state = state["sandbox_state"]
# Get file contents
result = await self.bash(
"cat /tmp/results.json",
sandbox_id,
sandbox_state
)
# Cache in state for reward functions
state["cached_results"] = result
def load_environment():
dataset = vf.Environment.make_dataset(
[{"task": "Generate results.json with analysis"}]
)
def check_results(state: vf.State) -> float:
"""Use cached results in reward function."""
results = state.get("cached_results", "")
return 1.0 if "analysis" in results else 0.0
return ResultCachingEnv(
dataset=dataset,
rubric=vf.Rubric(check_results)
)
With Retry Configuration
import verifiers as vf
def load_environment():
dataset = vf.Environment.make_dataset(
[{"task": "Flaky network operation"}]
)
return vf.SandboxEnv(
dataset=dataset,
rubric=vf.Rubric(lambda c: 1.0),
# Custom retry settings
max_retries=10,
base_delay=1.0,
backoff_factor=3.0,
max_backoff_seconds=60.0,
max_turns=5
)
Error Handling
Error Types
SandboxCreationError: Failed to create sandbox
SandboxNotReadyError: Sandbox failed to become ready
vf.SandboxError: Base class for sandbox errors
All inherit from vf.SandboxError which is included in stop_errors by default.
Command Timeouts
Commands that exceed timeout_per_command_seconds return an error message:
"Error: Command timed out after 30s"
The timeout is logged but does not raise an exception, allowing the model to retry or adjust.
Implementation Details
Lazy Initialization
Sandboxes are created during setup_state() but initialization is lazy:
- Container creation is queued asynchronously
- First
bash() call awaits container readiness
- Subsequent calls execute immediately
This overlaps provisioning with other rollout setup.
Cleanup Guarantees
Sandboxes are cleaned up via multiple mechanisms:
@vf.cleanup handler runs after each rollout
@vf.teardown handler runs on environment shutdown
- Sandboxes auto-destroy after
timeout_minutes
Bulk Operations
Use bulk_delete_sandboxes() to delete multiple sandboxes efficiently:
env = vf.SandboxEnv(...)
# ... create sandboxes ...
await env.bulk_delete_sandboxes(["sandbox-1", "sandbox-2", "sandbox-3"])
Batches of up to 100 sandboxes are deleted per API call.
When to Use
Use SandboxEnv for:
- Code execution in isolated environments
- System administration tasks
- File system operations
- Multi-language environments
- Security-sensitive operations
Use PythonEnv for:
- Python-specific REPL workflows
- Persistent Python state across executions
- Scientific computing tasks
Use StatefulToolEnv directly for:
- Non-sandbox stateful resources (databases, APIs)
- Custom state injection patterns
See Also