AnswerQuality measures how well a model response matches a ground-truth answer using SQuAD-style token overlap metrics. It is the default evaluator and runs on every dataset automatically.
Constructor
AnswerQuality takes no constructor parameters.
score()
The unmodified example dict. Must contain an
"answer" key with the ground-truth string.The output dict returned by the system under test. Must contain a
"response" key with the model’s output string.Return values
Token-level F1 between the response and the reference answer. Computed as the harmonic mean of token precision and recall after whitespace tokenization and lowercasing.
1.0 if the response (after normalization) exactly equals the reference answer, otherwise 0.0.Token-level recall: fraction of reference tokens that appear in the response.
1.0 if the reference answer string appears as a substring of the response (case-insensitive), otherwise 0.0.If
answer is empty, all four scores return 1.0. If response is empty, all four return 0.0.Example
contains: 1.0, recall: 1.0) but the extra words lower F1 and the strings are not identical (exact_match: 0.0).
Auto-wired datasets
AnswerQuality is applied to all datasets by default — no configuration required.
Implementation notes
Scoring is delegated to three utility functions fromcontext_bench.metrics.quality:
f1_score(response, reference)— tokenizes both strings, computes the number of common tokens, then returns2 * precision * recall / (precision + recall).exact_match(response, reference)— normalizes both strings (lowercase, strip punctuation, collapse whitespace) and returns1.0on equality.recall_score(response, reference)— returnscommon_tokens / len(reference_tokens).
