Skip to main content
CooperBench supports multiple execution backends for running agent tasks and evaluations. Each backend provides isolated sandbox environments.

Available backends

  • modal (default) - Serverless containers via Modal.com
  • docker - Local Docker containers
  • gcp - Google Cloud Platform Batch jobs

Using backends

Specify backend in run()

from cooperbench import run

# Use Modal (default)
run(
    run_name="modal_run",
    subset="lite",
    backend="modal",
)

# Use Docker
run(
    run_name="docker_run",
    subset="lite",
    backend="docker",
)

# Use GCP Batch
run(
    run_name="gcp_run",
    subset="lite",
    backend="gcp",
)

Specify backend in evaluate()

from cooperbench import evaluate

# Evaluate using Modal
evaluate(
    run_name="my_experiment",
    backend="modal",
)

# Evaluate using Docker (local)
evaluate(
    run_name="my_experiment",
    backend="docker",
    concurrency=5,  # Lower concurrency for local resources
)

# Evaluate using GCP Batch (high-scale)
evaluate(
    run_name="large_experiment",
    backend="gcp",
    concurrency=100,
)

Backend comparison

FeatureModalDockerGCP Batch
SetupRequires Modal accountLocal Docker onlyGCP project required
SpeedFast startupInstantSlower startup
ConcurrencyHigh (100+)Limited by local CPUVery high (1000+)
CostPay per secondFree (local)Pay per hour
Best forDevelopment, medium scaleLocal testing, debuggingLarge-scale evaluation

Backend API

All backends implement the EvalBackend protocol:
class EvalBackend(Protocol):
    """Backend for creating evaluation sandboxes."""

    def create_sandbox(
        self,
        image: str,
        timeout: int = 600,
        workdir: str = "/workspace",
    ) -> Sandbox:
        """Create a new sandbox for evaluation.

        Args:
            image: Docker image name
            timeout: Maximum runtime in seconds
            workdir: Working directory inside container

        Returns:
            Sandbox instance
        """
        ...

Sandbox interface

class Sandbox(Protocol):
    """Abstract sandbox for running commands."""

    def exec(self, *args: str) -> ExecResult:
        """Execute a command.

        Args:
            *args: Command and arguments (e.g., "bash", "-c", "echo hello")

        Returns:
            ExecResult with returncode and output
        """
        ...

    def terminate(self) -> None:
        """Clean up and terminate the sandbox."""
        ...

ExecResult structure

class ExecResult(Protocol):
    """Result of executing a command."""

    @property
    def returncode(self) -> int:
        """Exit code of the command."""
        ...

    def stdout_read(self) -> str:
        """Read stdout output."""
        ...

    def stderr_read(self) -> str:
        """Read stderr output."""
        ...

Using backends programmatically

Get backend instance

from cooperbench.eval.backends import get_backend

# Get Modal backend
modal_backend = get_backend("modal")

# Get Docker backend
docker_backend = get_backend("docker")

# Get GCP backend
gcp_backend = get_backend("gcp")

Create and use sandbox

from cooperbench.eval.backends import get_backend

# Create a sandbox
backend = get_backend("modal")
sandbox = backend.create_sandbox(
    image="cooperbench/llama_index_task:task1",
    timeout=600,
    workdir="/workspace",
)

try:
    # Run commands
    result = sandbox.exec("bash", "-c", "python --version")
    print(f"Exit code: {result.returncode}")
    print(f"Output: {result.stdout_read()}")

    # Apply a patch
    result = sandbox.exec("git", "apply", "agent.patch")

    # Run tests
    result = sandbox.exec("pytest", "tests/")
    print(f"Tests {'passed' if result.returncode == 0 else 'failed'}")
finally:
    sandbox.terminate()

Setup

  1. Install Modal:
pip install modal
  1. Authenticate:
modal token new
  1. Use in CooperBench:
from cooperbench import run

run(
    run_name="modal_test",
    subset="lite",
    backend="modal",
)

Features

  • Serverless execution (no infrastructure to manage)
  • Fast cold starts (typically under 10 seconds)
  • Auto-scaling based on concurrency
  • Pay-per-second billing

Configuration

Modal is configured via environment variables:
export MODAL_TOKEN_ID="your-token-id"
export MODAL_TOKEN_SECRET="your-token-secret"

Docker backend

Setup

  1. Install Docker:
# See https://docs.docker.com/get-docker/
  1. Pull required images:
docker pull cooperbench/llama_index_task:task1
docker pull cooperbench/django_task:task5
# etc.
  1. Use in CooperBench:
from cooperbench import run

run(
    run_name="docker_test",
    subset="lite",
    backend="docker",
    concurrency=5,  # Adjust based on your CPU
)

Features

  • Runs locally (no internet required)
  • No additional costs
  • Full control over environment
  • Good for debugging

Configuration

from cooperbench.eval.backends.docker import DockerBackend

# Create Docker backend with custom settings
backend = DockerBackend()
sandbox = backend.create_sandbox(
    image="cooperbench/llama_index_task:task1",
    timeout=300,
    workdir="/workspace",
)

GCP Batch backend

Setup

  1. Install GCP SDK:
pip install google-cloud-batch google-cloud-storage
  1. Authenticate:
gcloud auth application-default login
  1. Set project:
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GCP_REGION="us-central1"
  1. Use in CooperBench:
from cooperbench import evaluate

# GCP is best for large-scale evaluation
evaluate(
    run_name="large_experiment",
    backend="gcp",
    concurrency=100,
)

Features

  • Massive parallelism (1000+ concurrent tasks)
  • Batch job optimization (single VM startup for many tasks)
  • Cost-effective for large-scale runs
  • Auto-cleanup of resources

Batch evaluation

For GCP, evaluation uses batch mode by default:
from cooperbench import evaluate

# Submits all tasks as a single batch job
evaluate(
    run_name="my_experiment",
    backend="gcp",
    concurrency=200,  # Tasks run in parallel within the batch
)
Batch mode is more efficient because:
  • Single VM startup amortized across all tasks
  • Tasks run in parallel on the VM
  • Automatic cleanup after completion

Configuration

from cooperbench.eval.backends.gcp import GCPBatchBackend

backend = GCPBatchBackend(
    project_id="your-project",
    region="us-central1",
    machine_type="n1-standard-4",
)
Environment variables:
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GCP_REGION="us-central1"  # Optional, defaults to us-central1
export GCP_MACHINE_TYPE="n1-standard-4"  # Optional

Advanced usage

Custom backend implementation

You can implement custom backends:
from cooperbench.eval.backends.base import EvalBackend, Sandbox, ExecResult
from dataclasses import dataclass

@dataclass
class MyExecResult:
    returncode: int
    _stdout: str
    _stderr: str

    def stdout_read(self) -> str:
        return self._stdout

    def stderr_read(self) -> str:
        return self._stderr

class MySandbox:
    def __init__(self, image: str, timeout: int, workdir: str):
        self.image = image
        self.timeout = timeout
        self.workdir = workdir
        # Initialize your sandbox

    def exec(self, *args: str) -> ExecResult:
        # Execute command in your sandbox
        return MyExecResult(
            returncode=0,
            _stdout="Command output",
            _stderr="",
        )

    def terminate(self) -> None:
        # Clean up
        pass

class MyBackend:
    def create_sandbox(
        self,
        image: str,
        timeout: int = 600,
        workdir: str = "/workspace",
    ) -> Sandbox:
        return MySandbox(image, timeout, workdir)

Use custom backend

# Monkey-patch the backend getter
from cooperbench.eval import backends

original_get_backend = backends.get_backend

def custom_get_backend(name: str):
    if name == "my_backend":
        return MyBackend()
    return original_get_backend(name)

backends.get_backend = custom_get_backend

# Now use it
from cooperbench import run

run(
    run_name="custom_backend_test",
    subset="lite",
    backend="my_backend",
)

Direct sandbox usage

from cooperbench.eval.backends import get_backend

backend = get_backend("modal")

# Create sandbox
sandbox = backend.create_sandbox(
    image="python:3.11-slim",
    timeout=300,
)

try:
    # Install dependencies
    sandbox.exec("pip", "install", "requests")

    # Run your code
    result = sandbox.exec(
        "python",
        "-c",
        "import requests; print(requests.__version__)",
    )

    print(result.stdout_read())
finally:
    sandbox.terminate()

Best practices

Choose the right backend

  • Development: Use Modal for fast iteration
  • Debugging: Use Docker for local control
  • Large-scale: Use GCP for cost-effective parallelism

Optimize concurrency

# Modal: high concurrency works well
run(
    run_name="modal_run",
    subset="full",
    backend="modal",
    concurrency=100,
)

# Docker: adjust based on CPU cores
import multiprocessing
cpu_count = multiprocessing.cpu_count()

run(
    run_name="docker_run",
    subset="full",
    backend="docker",
    concurrency=cpu_count - 1,  # Leave one core free
)

# GCP: very high concurrency for evaluation
evaluate(
    run_name="gcp_eval",
    backend="gcp",
    concurrency=200,
)

Handle timeouts

# Increase timeout for complex tasks
run(
    run_name="complex_tasks",
    subset="lite",
    backend="modal",
    # Agent timeout configured in agent settings
)

# For evaluation, timeout is per-test
from cooperbench import run_patch_test

result = run_patch_test(
    repo_name="llama_index_task",
    task_id=1,
    feature_id=1,
    agent_patch="path/to/patch",
    timeout=1200,  # 20 minutes
    backend="modal",
)