Function signature
Parameters
Name of the run to evaluate. Corresponds to the run name used in
run().Filter evaluation to a specific subset (e.g.,
"lite").Filter by repository name (e.g.,
"llama_index_task").Filter to a specific task ID.
Specific feature pair to evaluate (e.g.,
[1, 2]).Number of parallel evaluations to run.
If
True, re-evaluates even if eval.json already exists.Evaluation backend:
"modal", "docker", or "gcp".Basic usage
Evaluate all runs
Evaluate a subset
Evaluate specific tasks
Advanced usage
Force re-evaluation
Use different backend
High-concurrency evaluation
How evaluation works
Cooperative mode
For cooperative runs, evaluation:- Merges patches from both agents (
agent1.patch+agent2.patch) - Applies the merged patch to the repository
- Applies feature 1 tests and runs them
- Applies feature 2 tests and runs them
- Reports whether both feature tests pass
Solo mode
For solo runs, evaluation:- Applies the single agent’s patch (
solo.patch) - Applies feature 1 tests and runs them
- Applies feature 2 tests and runs them
- Reports whether both feature tests pass
Evaluation results
Results are saved tologs/{run_name}/{setting}/{repo}/{task_id}/{features}/eval.json:
Result fields
Repository name
Task identifier
Feature pair that was tested
Execution mode (
"coop" or "solo")Merge information (cooperative mode only):
status:"success","conflict", or"failed"strategy: Git merge strategy used
Feature 1 test results:
passed: Whether all tests passedtests_passed: Number of passing teststests_failed: Number of failing teststest_output: Full test output
Feature 2 test results (same structure as
feature1)True if both feature tests passedError message if evaluation failed
ISO timestamp of evaluation
Summary output
A summary is also saved tologs/{run_name}/eval_summary.json:
Related functions
- run() - Execute benchmark tasks
- discover_runs() - Query completed runs
- test_merged() - Low-level patch testing (cooperative)
- test_solo() - Low-level patch testing (solo)
- evaluate_merge() - Training-compatible wrapper for test_merged()
Low-level testing functions
test_merged()
Test merged patches from two agents (cooperative mode).Parameters
Repository name
Task ID
First feature ID
Second feature ID
First agent’s patch (as string or path to .patch file)
Second agent’s patch (as string or path to .patch file)
Maximum execution time in seconds
Evaluation backend
test_solo()
Test a single patch against both feature tests (solo mode).test_merged() except only one patch parameter is used.
run_patch_test()
Test a patch against a single feature’s tests.Parameters
Repository name
Task ID
Feature ID to test
Agent’s patch (as string or path). If
None, uses gold patch from dataset.Maximum execution time in seconds
Evaluation backend
evaluate_merge()
Wrapper fortest_merged() that returns results in a format compatible with training code.
Parameters
Parameters are identical totest_merged().
Returns
Returns a dictionary with the following keys:Number of tests passed for feature 1
Total number of tests for feature 1
Number of tests passed for feature 2
Total number of tests for feature 2
Error message if evaluation failed, None otherwise
This function is primarily used for training and benchmarking workflows that require a specific result format.