Directory structure
All experiment outputs are saved to thelogs/ directory with the following structure:
Path components
Experiment name specified with
-n or --name flag when running cooperbench runEither
"coop" (two agents) or "solo" (one agent). Automatically determined by --setting flag.Repository name with
_task suffix (e.g., llama_index_task, pallets_jinja_task)Task ID number from the dataset (e.g.,
1621, 8394)Feature pair directory with sorted feature IDs (e.g.,
f1_f3, f2_f5)Cooperative mode output
When running with--setting coop (default), CooperBench generates per-agent outputs:
Agent patches
agent{i}.patch where {i} is the feature ID assigned to that agent.
Content: Standard unified diff format containing all changes made by the agent.
Note: Patch files are filtered to exclude test file changes before evaluation.
Agent trajectories
Trajectory schema
Repository name (e.g.,
"llama_index_task")Task ID number
Feature ID assigned to this agent
Agent identifier (e.g.,
"agent1", "agent2")LLM model used (e.g.,
"gpt-4o", "claude-sonnet-4.5")Completion status:
"Completed", "Error", or "Timeout"Total cost in USD for this agent’s execution
Number of agent steps (turns) taken
Complete conversation history including user prompts, assistant responses, and tool calls
Message format
Each message in the trajectory includes:One of:
"user", "assistant", "tool", or "system"Message content (prompt, response, or tool output)
Unix timestamp when message was created
Cost for this specific message (LLM responses only)
Input tokens for this turn (LLM responses only)
Output tokens for this turn (LLM responses only)
Tool name (tool messages only)
Conversation log
Sending agent ID
Receiving agent ID (or
"all" for broadcasts)Message content
Unix timestamp when sent
Always
false (only sent messages are logged; received messages are duplicates)The conversation log is generated by extracting messages with
role: "agent_message" from agent trajectories. Only sent messages are included to avoid duplication.Solo mode output
When running with--setting solo, a single agent handles both features:
Solo patch
solo.patch — contains changes for both features in a single patch.
Solo trajectory
featuresis an array (not singlefeature_id)agent_idis always"solo"- No conversation log (single agent)
- Higher step count (handles both features)
Result metadata
Every run generates aresult.json with execution metadata:
Result schema
Repository name
Task ID
Array of feature IDs (sorted)
"coop" or "solo"Unique 8-character identifier for this run
Experiment name
Agent implementation used (e.g.,
"mini_swe_agent")LLM model identifier
ISO 8601 timestamp
ISO 8601 timestamp
Total execution time in seconds
Per-agent statistics (coop mode only). Keys are agent IDs.
Single agent statistics (solo mode only)
Sum of all agent costs in USD
Sum of all agent steps
Number of inter-agent messages (coop mode only)
Absolute path to output directory
Evaluation results
After runningcooperbench eval, each task directory gets an eval.json file:
Evaluation schema
Repository name
Task ID
Feature IDs evaluated
"coop" or "solo"Merge information (coop mode only,
null for solo)Test results for first feature
Test results for second feature (same schema as feature1)
true only if both features passed and merge succeeded (for coop)Error message if evaluation failed (e.g., setup errors, timeouts)
ISO 8601 timestamp when evaluation completed
Evaluation summary
After evaluating all runs, CooperBench generates a summary:Experiment name
ISO 8601 timestamp
Total number of evaluated runs
Number of runs where both features passed
Number of runs where at least one feature failed
Number of runs that encountered evaluation errors
Number of runs skipped (already evaluated, unless
--force)Success rate:
passed / (passed + failed)Array of per-run results with
run identifier and status