PerDatasetBreakdown slices the per-example scores by the dataset tag on each
EvalRow and reports mean score per dataset. This is useful when you combine
multiple datasets in a single run and want to see how a system performs on each
one independently.
Constructor parameters
Name of the score key to average within each dataset bucket. Must match a
key emitted by an evaluator — for example
"f1", "mc_accuracy", or
"math_equiv".Return value
compute() returns a dict[str, float] where each key has the form
"dataset:<name>". The value is the mean of score_field for all examples
tagged with that dataset name.
Mean score for examples belonging to dataset
<name>. Keys are sorted
alphabetically. Examples without a dataset tag appear under
"dataset:unknown".hotpotqa and gsm8k produces:
Usage
When it is enabled
The CLI automatically addsPerDatasetBreakdown when two or more --dataset
flags are provided.
metrics list manually.
Each example must carry a
"dataset" key for the breakdown to be meaningful.
The CLI tags examples automatically. If you load data manually, set
example["dataset"] = "my-dataset" before passing to evaluate().