Skip to main content
MultipleChoiceAccuracy evaluates multiple-choice questions by extracting the letter selected by the model and comparing it to the ground-truth letter stored in the dataset example.
from context_bench.evaluators import MultipleChoiceAccuracy

Constructor

MultipleChoiceAccuracy takes no constructor parameters.
ev = MultipleChoiceAccuracy()

score()

def score(self, original: dict[str, Any], processed: dict[str, Any]) -> dict[str, float]
original
dict
required
The unmodified example dict. Must contain a "correct_letter" key (e.g. "A", "B") with the ground-truth answer letter.
processed
dict
required
The output dict returned by the system under test. Must contain a "response" key with the model’s output string.

Return values

mc_accuracy
float
required
1.0 if the extracted letter matches correct_letter, otherwise 0.0.
Returns 0.0 if either correct_letter or response is missing or empty.

Example

from context_bench.evaluators import MultipleChoiceAccuracy
ev = MultipleChoiceAccuracy()
ev.score({"correct_letter": "B"}, {"response": "The answer is B."})
# {'mc_accuracy': 1.0}

Auto-wired datasets

MultipleChoiceAccuracy is automatically applied when any of the following datasets are selected:
CLI nameDataset
mmluMMLU (4-choice, configurable per-subject)
arc-challengeARC-Challenge
gpqaGPQA Diamond
hellaswagHellaSwag
winograndeWinoGrande
mmlu-proMMLU-Pro (10-choice)

Implementation notes

The evaluator uses a cascade of regex patterns to extract the letter (A–J) from free-form model output:
  1. Single letter — if the entire stripped response is one character in A–J, use it directly.
  2. Labeled pattern — matches "The answer is A", "answer: B", "choice is C", etc.
  3. Parenthesized — matches (A), (B), etc.
  4. Fallback — takes the first standalone word-boundary letter in A–J found anywhere in the response.
If none of the patterns match, an empty string is returned and the score is 0.0.

Build docs developers (and LLMs) love