PerDatasetBreakdown

PerDatasetBreakdown slices the per-example scores by the dataset tag on each EvalRow and reports mean score per dataset. This is useful when you combine multiple datasets in a single run and want to see how a system performs on each one independently.

from context_bench.metrics import PerDatasetBreakdown

Constructor parameters

score_field

string

default:"f1"

Name of the score key to average within each dataset bucket. Must match a key emitted by an evaluator — for example "f1", "mc_accuracy", or "math_equiv".

Return value

compute() returns a dict[str, float] where each key has the form "dataset:<name>". The value is the mean of score_field for all examples tagged with that dataset name.

dataset:<name>

float

Mean score for examples belonging to dataset <name>. Keys are sorted alphabetically. Examples without a dataset tag appear under "dataset:unknown".

For example, a run over hotpotqa and gsm8k produces:

{
  "dataset:gsm8k": 0.81,
  "dataset:hotpotqa": 0.64
}

Usage

from context_bench import evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore, PerDatasetBreakdown

result = evaluate(
    systems=[my_system],
    dataset=hotpotqa_examples + gsm8k_examples,
    evaluators=[AnswerQuality()],
    metrics=[
        MeanScore(score_field="f1"),
        PerDatasetBreakdown(score_field="f1"),
    ],
)
for key, score in result.summary["my-system"].items():
    if key.startswith("dataset:"):
        print(key, score)
# dataset:gsm8k    0.81
# dataset:hotpotqa 0.64

When it is enabled

The CLI automatically adds PerDatasetBreakdown when two or more --dataset flags are provided.

# PerDatasetBreakdown is auto-added here
context-bench \
  --proxy http://localhost:7878 \
  --dataset hotpotqa --dataset gsm8k \
  --score-field f1

When using the Python API, add it to the metrics list manually.

Each example must carry a "dataset" key for the breakdown to be meaningful. The CLI tags examples automatically. If you load data manually, set example["dataset"] = "my-dataset" before passing to evaluate().

Python API

Evaluators

Metrics

Datasets

Constructor parameters

Return value

Usage

When it is enabled

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​Constructor parameters

​Return value

​Usage

​When it is enabled

Build docs developers (and LLMs) love

Constructor parameters

Return value

Usage

When it is enabled