MultipleChoiceAccuracy

MultipleChoiceAccuracy evaluates multiple-choice questions by extracting the letter selected by the model and comparing it to the ground-truth letter stored in the dataset example.

from context_bench.evaluators import MultipleChoiceAccuracy

Constructor

MultipleChoiceAccuracy takes no constructor parameters.

ev = MultipleChoiceAccuracy()

score()

def score(self, original: dict[str, Any], processed: dict[str, Any]) -> dict[str, float]

original

dict

required

The unmodified example dict. Must contain a "correct_letter" key (e.g. "A", "B") with the ground-truth answer letter.

processed

dict

required

The output dict returned by the system under test. Must contain a "response" key with the model’s output string.

Return values

mc_accuracy

float

required

1.0 if the extracted letter matches correct_letter, otherwise 0.0.

Returns 0.0 if either correct_letter or response is missing or empty.

Example

from context_bench.evaluators import MultipleChoiceAccuracy
ev = MultipleChoiceAccuracy()
ev.score({"correct_letter": "B"}, {"response": "The answer is B."})
# {'mc_accuracy': 1.0}

Auto-wired datasets

MultipleChoiceAccuracy is automatically applied when any of the following datasets are selected:

CLI name	Dataset
`mmlu`	MMLU (4-choice, configurable per-subject)
`arc-challenge`	ARC-Challenge
`gpqa`	GPQA Diamond
`hellaswag`	HellaSwag
`winogrande`	WinoGrande
`mmlu-pro`	MMLU-Pro (10-choice)

Implementation notes

The evaluator uses a cascade of regex patterns to extract the letter (A–J) from free-form model output:

Single letter — if the entire stripped response is one character in A–J, use it directly.
Labeled pattern — matches "The answer is A", "answer: B", "choice is C", etc.
Parenthesized — matches (A), (B), etc.
Fallback — takes the first standalone word-boundary letter in A–J found anywhere in the response.

If none of the patterns match, an empty string is returned and the score is 0.0.

Python API

Evaluators

Metrics

Datasets

Constructor

score()

Return values

Example

Auto-wired datasets

Implementation notes

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​Constructor

​score()

​Return values

​Example

​Auto-wired datasets

​Implementation notes

Build docs developers (and LLMs) love

Constructor

score()

Return values

Example

Auto-wired datasets

Implementation notes