Skip to main content
CooperBench is designed as a modular system that can execute agent tasks across different backends while maintaining consistent evaluation standards.

High-level architecture

Core components

Task runner

Orchestrates task execution, manages concurrency, and tracks results

Execution backends

Provide isolated sandboxes for agent execution (Modal, GCP, Docker)

Communication layer

Redis-based messaging for inter-agent communication

Evaluation pipeline

Tests merged patches and computes success metrics

Execution backends

CooperBench supports three execution backends, each with different tradeoffs:

Backend comparison

FeatureModalGCPDocker
Setup complexityLowMediumLow
ConcurrencyHigh (100+)High (100+)Low (CPU-bound)
CostUsage-basedVM-basedFree (local)
Cold start~5-10s~30-60s~2-5s
Internet requiredYesYesNo
Best forDevelopment, medium scaleProduction, large scaleLocal dev, debugging

Agent execution pipeline

When a task runs, CooperBench follows this execution flow:
1

Task discovery

# Discover tasks based on filters
tasks = discover_tasks(
    subset="lite",
    repo_filter="llama_index_task",
    task_filter=None,
    features_filter=None
)
# Returns: [{"repo": "...", "task_id": 123, "features": [1, 2]}]
2

Infrastructure setup

  • Redis: Start or connect to messaging server
  • Git server (if enabled): Create shared repository
  • Namespacing: Create unique run ID for isolation
3

Sandbox initialization

For each agent:
  • Pull task-specific Docker image
  • Mount dataset files
  • Configure environment variables
  • Set up git remote (if enabled)
  • Initialize Redis connection
4

Agent execution

# Load agent framework
runner = get_runner("mini_swe_agent")

# Execute task
result = runner.run(
    task=feature_description,
    image="cooperbench-llama-index-123",
    agent_id="agent1",
    model_name="gpt-4o",
    comm_url="redis://localhost:6379#run:abc123",
    git_server_url="git://git-server:9418",
)
5

Patch extraction

# Extract changes from agent's workspace
git diff HEAD > agent1.patch
6

Result aggregation

  • Collect patches from all agents
  • Extract conversation messages
  • Compute cost and token metrics
  • Save trajectories and logs

Redis messaging system

CooperBench uses Redis for real-time agent communication:

Architecture

Message flow

  1. Namespacing: Each run gets unique namespace run:{run_id}
  2. Channels: Per-agent channels run:{run_id}:{agent_id}
  3. Publishing: Agent sends message via send_message command
  4. Subscription: Agents poll for new messages
  5. Delivery: Messages appear in agent’s context as user messages
Example:
# Agent 1 publishes
redis.publish(
    "run:abc123:agent2",
    json.dumps({"from": "agent1", "message": "Starting feature 1"})
)

# Agent 2 receives (polled every N steps)
messages = redis.lrange("run:abc123:agent2:inbox", 0, -1)
# Appears in context as:
# "[Message from agent1]: Starting feature 1"

Configuration

# Use local Redis
cooperbench run -n exp --redis redis://localhost:6379

# Use remote Redis
cooperbench run -n exp --redis redis://cloud.redis.com:6379

# Auto-start Redis via Docker
cooperbench run -n exp  # detects and starts if needed

# Disable messaging
cooperbench run -n exp --no-messaging

Git collaboration mode

Optional git-based code sharing for agents:

Architecture

How it works

1

Server creation

# Create git server (per task)
git_server = create_git_server(
    backend="modal",  # or "gcp", "docker"
    run_id="abc123"
)
# Returns: url="git://server:9418"
2

Agent setup

# Configure remote in agent sandbox
git remote add team git://server:9418
git checkout -b agent1
git push -u team agent1
3

Collaboration

Agents can use standard git commands:
# Push changes
git add .
git commit -m "Implement feature"
git push team agent1

# Fetch teammate's work
git fetch team
git branch -r  # see team/agent2

# Merge changes
git merge team/agent2
4

Cleanup

# Automatically cleaned up after task
git_server.cleanup()

Backend-specific implementation

Evaluation pipeline

After agents complete tasks, patches are evaluated:

Evaluation flow

Evaluation steps

1

Patch loading

# Load agent patches
patch1 = Path("logs/.../agent1.patch").read_text()
patch2 = Path("logs/.../agent2.patch").read_text()

# Load test patches
tests1 = Path("dataset/.../feature1/tests.patch").read_text()
tests2 = Path("dataset/.../feature2/tests.patch").read_text()
2

Sandbox creation

Create isolated test environment:
  • Pull task Docker image
  • Clone repository at correct commit
  • Run setup script
3

Patch application

# Apply agent patches
git apply agent1.patch
git apply agent2.patch  # may conflict

# Apply test patches
git apply tests1.patch
git apply tests2.patch
4

Test execution

# Run complete test suite
bash run_tests.sh
5

Result analysis

result = {
    "both_passed": all_tests_passed,
    "feature1": {
        "passed": feature1_tests_passed,
        "test_output": "..."
    },
    "feature2": {
        "passed": feature2_tests_passed,
        "test_output": "..."
    },
    "merge_conflict": had_conflict,
}

Evaluation backends

Evaluation can run on different backends:
# Modal (default)
cooperbench eval -n exp --backend modal

# GCP Batch (efficient for large scale)
cooperbench eval -n exp --backend gcp

# Docker (local)
cooperbench eval -n exp --backend docker

Output structure

CooperBench generates comprehensive logs and metrics:
logs/{run_name}/
├── config.json                    # Run configuration
├── summary.json                   # Aggregate results
└── {setting}/                     # coop or solo
    └── {repo}/
        └── task{id}/
            └── f{i}_f{j}/         # Feature pair
                ├── result.json         # Task result
                ├── conversation.json   # Messages (coop only)
                ├── agent{i}.patch      # Agent patches
                ├── agent{i}_traj.json  # Trajectories
                └── eval.json           # Test results

Key output files

{
  "repo": "llama_index_task",
  "task_id": 123,
  "features": [1, 2],
  "setting": "coop",
  "total_cost": 0.45,
  "total_steps": 23,
  "duration_seconds": 125.3,
  "agents": {
    "agent1": {
      "feature_id": 1,
      "status": "Submitted",
      "cost": 0.23,
      "steps": 12,
      "patch_lines": 45
    },
    "agent2": {...}
  }
}
{
  "both_passed": true,
  "feature1": {
    "passed": true,
    "test_output": "test_cache.py::test_basic PASSED\n..."
  },
  "feature2": {
    "passed": true,
    "test_output": "test_logging.py::test_info PASSED\n..."
  },
  "merge_conflict": false,
  "evaluated_at": "2026-03-04T10:30:00"
}
[
  {
    "from": "agent1",
    "to": "agent2",
    "message": "I'm working on caching in src/cache.py",
    "timestamp": 1234567890,
    "feature_id": 1
  },
  {
    "from": "agent2",
    "to": "agent1",
    "message": "Got it, I'll handle logging separately",
    "timestamp": 1234567895,
    "feature_id": 2
  }
]

Concurrency and parallelization

CooperBench executes multiple tasks in parallel:
# Run with 30 parallel tasks
cooperbench run -n exp --concurrency 30

# Each task may spawn 2 agents (coop mode)
# Total: up to 60 concurrent sandboxes

Concurrency architecture

Backend handles spawning and managing agent sandboxes based on concurrency limits.

What’s next?

Quick start

Run your first benchmark with the architecture you learned

Backend setup

Configure Modal, GCP, or Docker backends

Dataset structure

Understand how tasks are organized

CLI reference

Complete command-line options and parameters