CodeExecution

CodeExecution evaluates code generation by executing the model’s output against the test cases provided in the dataset example, using a subprocess with a configurable timeout. A score of 1.0 (pass@1) means the first generated solution passes all tests.

from context_bench.evaluators import CodeExecution

Constructor

CodeExecution(timeout: float = 10.0)

timeout

float

default:"10.0"

Maximum time in seconds to wait for the subprocess to complete. If execution exceeds this limit, the attempt is counted as a failure.

score()

def score(self, original: dict[str, Any], processed: dict[str, Any]) -> dict[str, float]

original

dict

required

The unmodified example dict. Expected keys vary by dataset:

HumanEval: "context" (function prompt), "test" (check function body), "entry_point" (function name to call).
MBPP: "test_list" (list of assertion strings), optionally "test_setup_code" (imports/helpers).

processed

dict

required

The output dict returned by the system under test. Must contain a "response" key with the generated code string.

Return values

pass_at_1

float

required

1.0 if the generated code passes all test cases (subprocess exit code 0), otherwise 0.0.

Any exception during execution — including syntax errors, runtime errors, assertion failures, and timeouts — results in pass_at_1: 0.0.

Example

from context_bench.evaluators import CodeExecution
ev = CodeExecution(timeout=10.0)
ev.score(
    {"context": "def add(a, b):\n", "test": "def check(c):\n    assert c(1,2)==3\n", "entry_point": "add"},
    {"response": "    return a + b\n"},
)
# {'pass_at_1': 1.0}

Auto-wired datasets

CodeExecution is automatically applied when any of the following datasets are selected:

CLI name	Dataset
`humaneval`	HumanEval
`mbpp`	MBPP

Implementation notes

The evaluator assembles the executable code differently depending on the dataset format: HumanEval style (test field present):

Concatenate context + response + test into a single script.
Append check(entry_point) to invoke the test harness.

MBPP style (test_list field present):

Start with response.
Append optional test_setup_code (imports, helpers).
Append each assertion string from test_list.

The assembled script is passed to python3 -c via subprocess.run. Exit code 0 means all assertions passed. subprocess.TimeoutExpired and any other exceptions are caught and scored as 0.0.

Code execution runs untrusted model output in a subprocess on the host machine. Use a sandboxed environment for production benchmarking.

Python API

Evaluators

Metrics

Datasets

Constructor

score()

Return values

Example

Auto-wired datasets

Implementation notes

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​Constructor

​score()

​Return values

​Example

​Auto-wired datasets

​Implementation notes

Build docs developers (and LLMs) love

Constructor

score()

Return values

Example

Auto-wired datasets

Implementation notes