MeanScore

MeanScore computes the arithmetic mean of a named score field over every EvalRow in a run. It is the primary quality signal for most benchmarks.

from context_bench.metrics import MeanScore

Constructor parameters

score_field

string

default:"score"

Name of the score key to average. Must match a key emitted by an evaluator’s score() method — for example "f1", "exact_match", or "mc_accuracy".

Return value

compute() returns a dict[str, float] with the following key:

mean_score

float

Mean value of score_field across all examples. Range [0.0, 1.0] for standard evaluators. Returns 0.0 when the row list is empty.

Usage

from context_bench import evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore

result = evaluate(
    systems=[my_system],
    dataset=my_dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1")],
)
print(result.summary["my-system"]["mean_score"])  # e.g. 0.742

MeanScore(score_field="f1")

When it is enabled

MeanScore is included in every CLI run by default. The CLI flag --score-field controls which field it reads (default: f1).

score_field must match a key produced by an evaluator that is also registered for the run. If the field is absent on a row, that row contributes 0.0 to the mean.

Python API

Evaluators

Metrics

Datasets

Constructor parameters

Return value

Usage

When it is enabled

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​Constructor parameters

​Return value

​Usage

​When it is enabled

Build docs developers (and LLMs) love

Constructor parameters

Return value

Usage

When it is enabled