Execution backends - CooperBench

CooperBench supports multiple execution backends for running agent tasks and evaluations. Each backend provides isolated sandbox environments.

Available backends

modal (default) - Serverless containers via Modal.com
docker - Local Docker containers
gcp - Google Cloud Platform Batch jobs

Using backends

Specify backend in run()

from cooperbench import run

# Use Modal (default)
run(
    run_name="modal_run",
    subset="lite",
    backend="modal",
)

# Use Docker
run(
    run_name="docker_run",
    subset="lite",
    backend="docker",
)

# Use GCP Batch
run(
    run_name="gcp_run",
    subset="lite",
    backend="gcp",
)

Specify backend in evaluate()

from cooperbench import evaluate

# Evaluate using Modal
evaluate(
    run_name="my_experiment",
    backend="modal",
)

# Evaluate using Docker (local)
evaluate(
    run_name="my_experiment",
    backend="docker",
    concurrency=5,  # Lower concurrency for local resources
)

# Evaluate using GCP Batch (high-scale)
evaluate(
    run_name="large_experiment",
    backend="gcp",
    concurrency=100,
)

Backend comparison

Feature	Modal	Docker	GCP Batch
Setup	Requires Modal account	Local Docker only	GCP project required
Speed	Fast startup	Instant	Slower startup
Concurrency	High (100+)	Limited by local CPU	Very high (1000+)
Cost	Pay per second	Free (local)	Pay per hour
Best for	Development, medium scale	Local testing, debugging	Large-scale evaluation

Backend API

All backends implement the EvalBackend protocol:

class EvalBackend(Protocol):
    """Backend for creating evaluation sandboxes."""

    def create_sandbox(
        self,
        image: str,
        timeout: int = 600,
        workdir: str = "/workspace",
    ) -> Sandbox:
        """Create a new sandbox for evaluation.

        Args:
            image: Docker image name
            timeout: Maximum runtime in seconds
            workdir: Working directory inside container

        Returns:
            Sandbox instance
        """
        ...

Sandbox interface

class Sandbox(Protocol):
    """Abstract sandbox for running commands."""

    def exec(self, *args: str) -> ExecResult:
        """Execute a command.

        Args:
            *args: Command and arguments (e.g., "bash", "-c", "echo hello")

        Returns:
            ExecResult with returncode and output
        """
        ...

    def terminate(self) -> None:
        """Clean up and terminate the sandbox."""
        ...

ExecResult structure

class ExecResult(Protocol):
    """Result of executing a command."""

    @property
    def returncode(self) -> int:
        """Exit code of the command."""
        ...

    def stdout_read(self) -> str:
        """Read stdout output."""
        ...

    def stderr_read(self) -> str:
        """Read stderr output."""
        ...

Using backends programmatically

Get backend instance

from cooperbench.eval.backends import get_backend

# Get Modal backend
modal_backend = get_backend("modal")

# Get Docker backend
docker_backend = get_backend("docker")

# Get GCP backend
gcp_backend = get_backend("gcp")

Create and use sandbox

from cooperbench.eval.backends import get_backend

# Create a sandbox
backend = get_backend("modal")
sandbox = backend.create_sandbox(
    image="cooperbench/llama_index_task:task1",
    timeout=600,
    workdir="/workspace",
)

try:
    # Run commands
    result = sandbox.exec("bash", "-c", "python --version")
    print(f"Exit code: {result.returncode}")
    print(f"Output: {result.stdout_read()}")

    # Apply a patch
    result = sandbox.exec("git", "apply", "agent.patch")

    # Run tests
    result = sandbox.exec("pytest", "tests/")
    print(f"Tests {'passed' if result.returncode == 0 else 'failed'}")
finally:
    sandbox.terminate()

Setup

Install Modal:

pip install modal

Authenticate:

modal token new

Use in CooperBench:

from cooperbench import run

run(
    run_name="modal_test",
    subset="lite",
    backend="modal",
)

Features

Serverless execution (no infrastructure to manage)
Fast cold starts (typically under 10 seconds)
Auto-scaling based on concurrency
Pay-per-second billing

Configuration

Modal is configured via environment variables:

export MODAL_TOKEN_ID="your-token-id"
export MODAL_TOKEN_SECRET="your-token-secret"

Docker backend

Setup

Install Docker:

# See https://docs.docker.com/get-docker/

Pull required images:

docker pull cooperbench/llama_index_task:task1
docker pull cooperbench/django_task:task5
# etc.

Use in CooperBench:

from cooperbench import run

run(
    run_name="docker_test",
    subset="lite",
    backend="docker",
    concurrency=5,  # Adjust based on your CPU
)

Features

Runs locally (no internet required)
No additional costs
Full control over environment
Good for debugging

Configuration

from cooperbench.eval.backends.docker import DockerBackend

# Create Docker backend with custom settings
backend = DockerBackend()
sandbox = backend.create_sandbox(
    image="cooperbench/llama_index_task:task1",
    timeout=300,
    workdir="/workspace",
)

GCP Batch backend

Setup

Install GCP SDK:

pip install google-cloud-batch google-cloud-storage

Authenticate:

gcloud auth application-default login

Set project:

export GOOGLE_CLOUD_PROJECT="your-project-id"
export GCP_REGION="us-central1"

Use in CooperBench:

from cooperbench import evaluate

# GCP is best for large-scale evaluation
evaluate(
    run_name="large_experiment",
    backend="gcp",
    concurrency=100,
)

Features

Massive parallelism (1000+ concurrent tasks)
Batch job optimization (single VM startup for many tasks)
Cost-effective for large-scale runs
Auto-cleanup of resources

Batch evaluation

For GCP, evaluation uses batch mode by default:

from cooperbench import evaluate

# Submits all tasks as a single batch job
evaluate(
    run_name="my_experiment",
    backend="gcp",
    concurrency=200,  # Tasks run in parallel within the batch
)

Batch mode is more efficient because:

Single VM startup amortized across all tasks
Tasks run in parallel on the VM
Automatic cleanup after completion

Configuration

from cooperbench.eval.backends.gcp import GCPBatchBackend

backend = GCPBatchBackend(
    project_id="your-project",
    region="us-central1",
    machine_type="n1-standard-4",
)

Environment variables:

export GOOGLE_CLOUD_PROJECT="your-project-id"
export GCP_REGION="us-central1"  # Optional, defaults to us-central1
export GCP_MACHINE_TYPE="n1-standard-4"  # Optional

Advanced usage

Custom backend implementation

You can implement custom backends:

from cooperbench.eval.backends.base import EvalBackend, Sandbox, ExecResult
from dataclasses import dataclass

@dataclass
class MyExecResult:
    returncode: int
    _stdout: str
    _stderr: str

    def stdout_read(self) -> str:
        return self._stdout

    def stderr_read(self) -> str:
        return self._stderr

class MySandbox:
    def __init__(self, image: str, timeout: int, workdir: str):
        self.image = image
        self.timeout = timeout
        self.workdir = workdir
        # Initialize your sandbox

    def exec(self, *args: str) -> ExecResult:
        # Execute command in your sandbox
        return MyExecResult(
            returncode=0,
            _stdout="Command output",
            _stderr="",
        )

    def terminate(self) -> None:
        # Clean up
        pass

class MyBackend:
    def create_sandbox(
        self,
        image: str,
        timeout: int = 600,
        workdir: str = "/workspace",
    ) -> Sandbox:
        return MySandbox(image, timeout, workdir)

Use custom backend

# Monkey-patch the backend getter
from cooperbench.eval import backends

original_get_backend = backends.get_backend

def custom_get_backend(name: str):
    if name == "my_backend":
        return MyBackend()
    return original_get_backend(name)

backends.get_backend = custom_get_backend

# Now use it
from cooperbench import run

run(
    run_name="custom_backend_test",
    subset="lite",
    backend="my_backend",
)

Direct sandbox usage

from cooperbench.eval.backends import get_backend

backend = get_backend("modal")

# Create sandbox
sandbox = backend.create_sandbox(
    image="python:3.11-slim",
    timeout=300,
)

try:
    # Install dependencies
    sandbox.exec("pip", "install", "requests")

    # Run your code
    result = sandbox.exec(
        "python",
        "-c",
        "import requests; print(requests.__version__)",
    )

    print(result.stdout_read())
finally:
    sandbox.terminate()

Best practices

Choose the right backend

Development: Use Modal for fast iteration
Debugging: Use Docker for local control
Large-scale: Use GCP for cost-effective parallelism

Optimize concurrency

# Modal: high concurrency works well
run(
    run_name="modal_run",
    subset="full",
    backend="modal",
    concurrency=100,
)

# Docker: adjust based on CPU cores
import multiprocessing
cpu_count = multiprocessing.cpu_count()

run(
    run_name="docker_run",
    subset="full",
    backend="docker",
    concurrency=cpu_count - 1,  # Leave one core free
)

# GCP: very high concurrency for evaluation
evaluate(
    run_name="gcp_eval",
    backend="gcp",
    concurrency=200,
)

Handle timeouts

# Increase timeout for complex tasks
run(
    run_name="complex_tasks",
    subset="lite",
    backend="modal",
    # Agent timeout configured in agent settings
)

# For evaluation, timeout is per-test
from cooperbench import run_patch_test

result = run_patch_test(
    repo_name="llama_index_task",
    task_id=1,
    feature_id=1,
    agent_patch="path/to/patch",
    timeout=1200,  # 20 minutes
    backend="modal",
)

run() - Execute tasks with backends
evaluate() - Evaluate with backends
get_backend() - Get backend instance

Core Functions

Advanced

​Available backends

​Using backends

​Specify backend in run()

​Specify backend in evaluate()

​Backend comparison

​Backend API

​Sandbox interface

​ExecResult structure

​Using backends programmatically

​Get backend instance

​Create and use sandbox

​Modal backend

​Setup

​Features

​Configuration

​Docker backend

​Setup

​Features

​Configuration

​GCP Batch backend

​Setup

​Features

​Batch evaluation

​Configuration

​Advanced usage

​Custom backend implementation

​Use custom backend

​Direct sandbox usage

​Best practices

​Choose the right backend

​Optimize concurrency

​Handle timeouts

​Related functions

Available backends

Using backends

Specify backend in run()

Specify backend in evaluate()

Backend comparison

Backend API

Sandbox interface

ExecResult structure

Using backends programmatically

Get backend instance

Create and use sandbox

Modal backend

Setup

Features

Configuration

Docker backend

Setup

Features

Configuration

GCP Batch backend

Setup

Features

Batch evaluation

Configuration

Advanced usage

Custom backend implementation

Use custom backend

Direct sandbox usage

Best practices

Choose the right backend

Optimize concurrency

Handle timeouts

Related functions