CostOfPass measures efficiency by combining quality and token usage into a
single number: how many output tokens does it take, on average, to produce one
passing response?
This metric is described in arXiv:2504.13359.
Constructor parameters
Minimum score for an example to be counted as a pass.
Matches the semantics of
PassRate.threshold.Name of the score key used to determine pass/fail.
Must match a key emitted by an evaluator — for example
"f1" or "mc_accuracy".Formula
cost_of_pass is inf.
Return value
compute() returns a dict[str, float] with the following keys:
Total output tokens divided by the number of passing examples.
Returns
inf when zero examples pass.Number of examples that met the pass threshold. Stored as
float for
consistency with the summary dict type.Usage
When it is enabled
CostOfPass is included in every CLI run by default. The --threshold flag
controls the pass threshold and --score-field controls which field is read.
