High-level architecture
Core components
Task runner
Orchestrates task execution, manages concurrency, and tracks results
Execution backends
Provide isolated sandboxes for agent execution (Modal, GCP, Docker)
Communication layer
Redis-based messaging for inter-agent communication
Evaluation pipeline
Tests merged patches and computes success metrics
Execution backends
CooperBench supports three execution backends, each with different tradeoffs:- Modal (default)
- Google Cloud Platform
- Docker (local)
Cloud-based serverless executionModal provides managed containerized sandboxes that scale automatically.Architecture:Features:Pros:
- Automatic scaling to concurrency limits
- Fast cold starts (~5-10 seconds)
- Managed infrastructure
- GPU support available
- Pay-per-use pricing
- Zero infrastructure management
- Excellent for parallel execution
- Fast iteration cycles
- Good for small to medium experiments
- Requires internet connection
- Pay-per-use costs
- Less control over environment
Backend comparison
| Feature | Modal | GCP | Docker |
|---|---|---|---|
| Setup complexity | Low | Medium | Low |
| Concurrency | High (100+) | High (100+) | Low (CPU-bound) |
| Cost | Usage-based | VM-based | Free (local) |
| Cold start | ~5-10s | ~30-60s | ~2-5s |
| Internet required | Yes | Yes | No |
| Best for | Development, medium scale | Production, large scale | Local dev, debugging |
Agent execution pipeline
When a task runs, CooperBench follows this execution flow:Infrastructure setup
- Redis: Start or connect to messaging server
- Git server (if enabled): Create shared repository
- Namespacing: Create unique run ID for isolation
Sandbox initialization
For each agent:
- Pull task-specific Docker image
- Mount dataset files
- Configure environment variables
- Set up git remote (if enabled)
- Initialize Redis connection
Redis messaging system
CooperBench uses Redis for real-time agent communication:Architecture
Message flow
How messaging works
How messaging works
- Namespacing: Each run gets unique namespace
run:{run_id} - Channels: Per-agent channels
run:{run_id}:{agent_id} - Publishing: Agent sends message via
send_messagecommand - Subscription: Agents poll for new messages
- Delivery: Messages appear in agent’s context as user messages
Configuration
Git collaboration mode
Optional git-based code sharing for agents:Architecture
How it works
Backend-specific implementation
- Modal
- GCP
- Docker
- Serverless git daemon in Modal sandbox
- Network-accessible within Modal
- Automatic cleanup on completion
Evaluation pipeline
After agents complete tasks, patches are evaluated:Evaluation flow
Evaluation steps
Sandbox creation
Create isolated test environment:
- Pull task Docker image
- Clone repository at correct commit
- Run setup script
Evaluation backends
Evaluation can run on different backends:Output structure
CooperBench generates comprehensive logs and metrics:Key output files
result.json - Task execution results
result.json - Task execution results
eval.json - Test results
eval.json - Test results
conversation.json - Inter-agent messages
conversation.json - Inter-agent messages
Concurrency and parallelization
CooperBench executes multiple tasks in parallel:Concurrency architecture
Backend handles spawning and managing agent sandboxes based on concurrency limits.What’s next?
Quick start
Run your first benchmark with the architecture you learned
Backend setup
Configure Modal, GCP, or Docker backends
Dataset structure
Understand how tasks are organized
CLI reference
Complete command-line options and parameters