Skip to main content
An evaluator compares the original example (before your system ran) with the processed example (after) and returns a dict[str, float] of named scores.

The Evaluator protocol

from typing import Any, Protocol, runtime_checkable

@runtime_checkable
class Evaluator(Protocol):
    """Scores a (input, output) pair. Returns dict of metric_name -> float."""

    @property
    def name(self) -> str: ...

    def score(
        self, original: dict[str, Any], processed: dict[str, Any]
    ) -> dict[str, float]: ...
original is the raw dataset example. processed is the dict returned by System.process(). Evaluators read fields from both — for example, original["answer"] (the ground truth) and processed["response"] (the system’s output).

Auto-wiring

When you run the CLI or call evaluate(), evaluators are selected automatically based on which datasets you load. You do not configure them manually. For example, loading humaneval auto-wires CodeExecution; loading mmlu auto-wires MultipleChoiceAccuracy. For the Python API, pass evaluators explicitly:
from context_bench.evaluators import AnswerQuality, MathEquivalence

result = evaluate(
    systems=[my_system],
    dataset=my_dataset,
    evaluators=[AnswerQuality(), MathEquivalence()],
    ...
)

Built-in evaluators

AnswerQuality

Token-level F1 and exact match using SQuAD-style text normalization. Applied to every dataset by default.
Score fieldDescription
f1Token-overlap F1 between answer and response
exact_match1.0 if normalized strings match exactly
recallFraction of answer tokens present in response
contains1.0 if answer appears as a substring in response
from context_bench.evaluators import AnswerQuality

ev = AnswerQuality()
ev.score({"answer": "Paris"}, {"response": "The capital is Paris."})
# {'f1': 0.5, 'exact_match': 0.0, 'recall': 1.0, 'contains': 1.0}
Auto-wired for: all datasets.

SummarizationQuality

ROUGE-L precision, recall, and F1. Auto-wired for summarization datasets.
Score fieldDescription
rouge_l_precisionROUGE-L precision
rouge_l_recallROUGE-L recall
rouge_l_f1ROUGE-L F1
Auto-wired for: multi-news, dialogsum, qmsum, summscreenfd, meetingbank, govreport.

MultipleChoiceAccuracy

Extracts the chosen letter (A–J) from the response and compares it to the correct answer.
Score fieldDescription
mc_accuracy1.0 if extracted letter matches correct letter
from context_bench.evaluators import MultipleChoiceAccuracy

ev = MultipleChoiceAccuracy()
ev.score({"correct_letter": "B"}, {"response": "The answer is B."})
# {'mc_accuracy': 1.0}
Auto-wired for: mmlu, mmlu-pro, arc-challenge, gpqa, hellaswag, winogrande.

CodeExecution

Runs generated code against test cases in a subprocess and reports pass@1.
Score fieldDescription
pass_at_11.0 if the generated code passes all test cases
from context_bench.evaluators import CodeExecution

ev = CodeExecution(timeout=10.0)
ev.score(
    {
        "context": "def add(a, b):\n",
        "test": "def check(c):\n    assert c(1,2)==3\n",
        "entry_point": "add",
    },
    {"response": "    return a + b\n"},
)
# {'pass_at_1': 1.0}
Auto-wired for: humaneval, mbpp.

MathEquivalence

LaTeX-aware numeric comparison. Handles fractions, percentages, and boxed answers.
Score fieldDescription
math_equiv1.0 if the response is mathematically equivalent to the answer
from context_bench.evaluators import MathEquivalence

ev = MathEquivalence()
ev.score({"answer": r"\frac{1}{2}"}, {"response": "0.5"})
# {'math_equiv': 1.0}

ev.score({"answer": "42"}, {"response": r"The answer is $\boxed{42}$."})
# {'math_equiv': 1.0}
Auto-wired for: math, gsm8k, mgsm.

NLILabelMatch

Extracts classification labels from responses with alias mapping (e.g., “yes” → “entailment”).
Score fieldDescription
nli_accuracy1.0 if the extracted label matches the ground-truth label
from context_bench.evaluators import NLILabelMatch

ev = NLILabelMatch()
ev.score({"answer": "Entailment"}, {"response": "Yes, this is true."})
# {'nli_accuracy': 1.0}  ("yes" maps to "entailment")
Auto-wired for: contract-nli, scifact.

IFEvalChecker

19 programmatic constraint checks covering keywords, length, format, case, and more.
Score fieldDescription
ifeval_strictAll constraints satisfied (strict mode)
ifeval_looseMajority of constraints satisfied (loose mode)
from context_bench.evaluators import IFEvalChecker

ev = IFEvalChecker()
ev.score(
    {
        "instruction_id_list": ["punctuation:no_comma", "keywords:existence"],
        "kwargs": [{}, {"keywords": ["hello"]}],
    },
    {"response": "hello world"},
)
# {'ifeval_strict': 1.0, 'ifeval_loose': 1.0}
Auto-wired for: ifeval.

LLMJudge

Uses an external LLM to rate responses on a 1–5 scale, normalized to 0–1.
Score fieldDescription
judge_scoreRating from 1–5 normalized to 0–1
from context_bench.evaluators import LLMJudge

judge = LLMJudge(base_url="http://localhost:9090", model="gpt-4")
judge.score(
    {"question": "What is 2+2?", "answer": "4"},
    {"response": "The answer is 4."},
)
# {'judge_score': 0.75}  (rating 4/5 → normalized to 0–1)
Auto-wired for: any dataset when --judge-url is provided.
Enable LLMJudge from the CLI by passing --judge-url http://localhost:9090 --judge-model gpt-4.

Implementing a custom evaluator

Any class with name and score() satisfies the protocol:
class WordCountDiff:
    name = "word-count-diff"

    def score(self, original, processed):
        original_words = len(original.get("context", "").split())
        output_words = len(processed.get("response", "").split())
        ratio = output_words / original_words if original_words else 0.0
        return {"word_count_ratio": ratio}
Pass it alongside built-in evaluators:
result = evaluate(
    systems=[my_system],
    dataset=my_dataset,
    evaluators=[AnswerQuality(), WordCountDiff()],
    ...
)
All score fields from all evaluators are merged into the same EvalRow.scores dict, so they are all available to metrics.
Evaluator is a typing.Protocol. Implement the methods on any class — no imports from context-bench are required to define a custom evaluator.

Build docs developers (and LLMs) love