MultipleChoiceAccuracy evaluates multiple-choice questions by extracting the letter selected by the model and comparing it to the ground-truth letter stored in the dataset example.
Constructor
MultipleChoiceAccuracy takes no constructor parameters.
score()
The unmodified example dict. Must contain a
"correct_letter" key (e.g. "A", "B") with the ground-truth answer letter.The output dict returned by the system under test. Must contain a
"response" key with the model’s output string.Return values
1.0 if the extracted letter matches correct_letter, otherwise 0.0.Returns
0.0 if either correct_letter or response is missing or empty.Example
Auto-wired datasets
MultipleChoiceAccuracy is automatically applied when any of the following datasets are selected:
| CLI name | Dataset |
|---|---|
mmlu | MMLU (4-choice, configurable per-subject) |
arc-challenge | ARC-Challenge |
gpqa | GPQA Diamond |
hellaswag | HellaSwag |
winogrande | WinoGrande |
mmlu-pro | MMLU-Pro (10-choice) |
Implementation notes
The evaluator uses a cascade of regex patterns to extract the letter (A–J) from free-form model output:- Single letter — if the entire stripped response is one character in
A–J, use it directly. - Labeled pattern — matches
"The answer is A","answer: B","choice is C", etc. - Parenthesized — matches
(A),(B), etc. - Fallback — takes the first standalone word-boundary letter in
A–Jfound anywhere in the response.
0.0.