CodeExecution evaluates code generation by executing the model’s output against the test cases provided in the dataset example, using a subprocess with a configurable timeout. A score of 1.0 (pass@1) means the first generated solution passes all tests.
Constructor
Maximum time in seconds to wait for the subprocess to complete. If execution exceeds this limit, the attempt is counted as a failure.
score()
The unmodified example dict. Expected keys vary by dataset:
- HumanEval:
"context"(function prompt),"test"(check function body),"entry_point"(function name to call). - MBPP:
"test_list"(list of assertion strings), optionally"test_setup_code"(imports/helpers).
The output dict returned by the system under test. Must contain a
"response" key with the generated code string.Return values
1.0 if the generated code passes all test cases (subprocess exit code 0), otherwise 0.0.Any exception during execution — including syntax errors, runtime errors, assertion failures, and timeouts — results in
pass_at_1: 0.0.Example
Auto-wired datasets
CodeExecution is automatically applied when any of the following datasets are selected:
| CLI name | Dataset |
|---|---|
humaneval | HumanEval |
mbpp | MBPP |
Implementation notes
The evaluator assembles the executable code differently depending on the dataset format: HumanEval style (test field present):
- Concatenate
context+response+testinto a single script. - Append
check(entry_point)to invoke the test harness.
test_list field present):
- Start with
response. - Append optional
test_setup_code(imports, helpers). - Append each assertion string from
test_list.
python3 -c via subprocess.run. Exit code 0 means all assertions passed. subprocess.TimeoutExpired and any other exceptions are caught and scored as 0.0.
