SummarizationQuality

SummarizationQuality measures the quality of generated summaries against reference summaries using ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation — Longest Common Subsequence). It is implemented entirely in pure Python with no external dependencies.

from context_bench.evaluators import SummarizationQuality

Constructor

SummarizationQuality takes no constructor parameters.

ev = SummarizationQuality()

score()

def score(self, original: dict[str, Any], processed: dict[str, Any]) -> dict[str, float]

original

dict

required

The unmodified example dict. Must contain an "answer" key with the reference summary string.

processed

dict

required

The output dict returned by the system under test. Must contain a "response" key with the generated summary string.

Return values

rouge_l_precision

float

required

Fraction of response tokens that appear in the LCS with the reference. LCS_length / len(response_tokens).

rouge_l_recall

float

required

Fraction of reference tokens covered by the LCS. LCS_length / len(reference_tokens).

rouge_l_f1

float

required

Harmonic mean of ROUGE-L precision and recall.

If answer is empty, all three scores return 1.0. If response is empty, all three return 0.0.

Example

from context_bench.evaluators import SummarizationQuality
ev = SummarizationQuality()
ev.score(
    {"answer": "The committee approved the budget."},
    {"response": "The budget was approved by the committee."},
)
# {'rouge_l_precision': 0.857, 'rouge_l_recall': 0.857, 'rouge_l_f1': 0.857}

Auto-wired datasets

SummarizationQuality is automatically applied when any of the following datasets are selected:

CLI name	Dataset
`multi-news`	Multi-News
`dialogsum`	DialogSum
`qmsum`	QMSum
`summscreenfd`	SummScreenFD
`meetingbank`	MeetingBank
`govreport`	GovReport

Implementation notes

The algorithm uses a space-optimized dynamic programming implementation of the longest common subsequence (LCS), keeping only two rows in memory at a time to reduce memory usage on long documents:

Both strings are lowercased and split on whitespace to produce token lists.
The LCS length between the two token lists is computed with the DP algorithm.
Precision, recall, and F1 are derived from the LCS length and the lengths of the two token lists.

The standalone rouge_l(prediction, reference) function is also importable and returns the same three keys:

from context_bench.evaluators.rouge import rouge_l
rouge_l("the cat sat", "the cat sat on the mat")
# {'rouge_l_precision': 1.0, 'rouge_l_recall': 0.5, 'rouge_l_f1': 0.667}

Python API

Evaluators

Metrics

Datasets

Constructor

score()

Return values

Example

Auto-wired datasets

Implementation notes

Build docs developers (and LLMs) love

Python API

Evaluators

Metrics

Datasets

​Constructor

​score()

​Return values

​Example

​Auto-wired datasets

​Implementation notes

Build docs developers (and LLMs) love

Constructor

score()

Return values

Example

Auto-wired datasets

Implementation notes