compute_metrics
Mathematical definitions
The metrics are computed using standard formulas from benchmark papers: Accuracy (for classification tasks like SST-2): Exact Match (for QA tasks like GSM8K): where are predictions and are references.Parameters
Name of the task to evaluate. Supported values:
"sst-2"or"classification"- computes accuracy"gsm8k"or"math"- computes exact match"samsum"or"summarization"- computes ROUGE scores
Model predictions. Must have the same length as references.
Ground truth labels or answers. Must have the same length as predictions.
Returns
Container with metric scores keyed by metric name. The specific metrics depend on the task:
- Classification:
{"accuracy": float} - QA/Math:
{"exact_match": float} - Summarization:
{"rouge1": float, "rouge2": float, "rougeL": float}
Raises
ValueError- If predictions and references have different lengthsNotImplementedError- If task_name is not supportedImportError- If evaluating summarization tasks without theevaluatepackage
Complexity
- Time: O(N) where N is the number of samples
- Space: O(1) for accuracy/exact match, O(N) for ROUGE
Usage
compute_f1
Mathematical definitions
For each class : Macro-averaged F1: where is the number of classes, and , , are true positives, false positives, and false negatives respectively.Parameters
Predicted class labels. Must be integers in range [0, num_classes).
Ground truth class labels. Must be integers in range [0, num_classes).
Total number of classes in the classification task.
Returns
Dictionary containing:
macro_f1(float): Macro-averaged F1 score across all classesper_class_f1(List[float]): F1 score for each class
Raises
ValueError- If predictions and references have different lengths
Usage
EvaluationResult
Attributes
Name of the evaluated task (e.g., “sst-2”, “gsm8k”, “samsum”).
Dictionary mapping metric names to their computed values. The keys depend on the task:
- Classification tasks:
"accuracy" - QA/Math tasks:
"exact_match" - Summarization tasks:
"rouge1","rouge2","rougeL"