PerQATypeMetric slices scores by the qa_type key in each EvalRow’s metadata and computes the mean score per type. It is designed for memory system evaluations where examples have QA types such as single_hop, multi_hop, temporal, open_domain, and adversarial.
Constructor parameters
Name of the score key to aggregate per QA type. Must match a key emitted by an evaluator.
Return value
compute() returns a dict[str, float] with:
Mean score for each QA type present in the rows. For example, with
score_field="f1" and QA types temporal and multi_hop, the returned dict contains f1_temporal and f1_multi_hop.Overall mean score across all QA types.
Usage
When it is enabled
PerQATypeMetric is automatically used by the context-bench memory CLI subcommand. It reads row.metadata["qa_type"] which is populated by evaluate_memory() from the dataset’s QA pair type field.
Rows where
qa_type is absent in metadata are grouped under the key "unknown".