SummarizationQuality measures the quality of generated summaries against reference summaries using ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation — Longest Common Subsequence). It is implemented entirely in pure Python with no external dependencies.
Constructor
SummarizationQuality takes no constructor parameters.
score()
The unmodified example dict. Must contain an
"answer" key with the reference summary string.The output dict returned by the system under test. Must contain a
"response" key with the generated summary string.Return values
Fraction of response tokens that appear in the LCS with the reference.
LCS_length / len(response_tokens).Fraction of reference tokens covered by the LCS.
LCS_length / len(reference_tokens).Harmonic mean of ROUGE-L precision and recall.
If
answer is empty, all three scores return 1.0. If response is empty, all three return 0.0.Example
Auto-wired datasets
SummarizationQuality is automatically applied when any of the following datasets are selected:
| CLI name | Dataset |
|---|---|
multi-news | Multi-News |
dialogsum | DialogSum |
qmsum | QMSum |
summscreenfd | SummScreenFD |
meetingbank | MeetingBank |
govreport | GovReport |
Implementation notes
The algorithm uses a space-optimized dynamic programming implementation of the longest common subsequence (LCS), keeping only two rows in memory at a time to reduce memory usage on long documents:- Both strings are lowercased and split on whitespace to produce token lists.
- The LCS length between the two token lists is computed with the DP algorithm.
- Precision, recall, and F1 are derived from the LCS length and the lengths of the two token lists.
rouge_l(prediction, reference) function is also importable and returns the same three keys:
