QAExactMatch
Measures whether the predicted answer exactly matches any of the gold answers after normalization.Usage
Parameters
Global configuration object (optional)
Methods
calculate_metric_scores
Calculates the Exact Match (EM) score. Signature:List of lists containing ground truth answers. Each inner list contains multiple acceptable answers for that example.
List of predicted answers, one per example.
Function to aggregate scores across multiple gold answers. Defaults to taking the maximum score.
Dict[str, float]: Pooled results with the averaged EM score across all examplesList[Dict[str, float]]: Per-example results with EM scores
Interpretation
- Score Range: 0.0 to 1.0
- Higher is Better: Yes
- Perfect Score: 1.0 means all predicted answers exactly match at least one gold answer
- Use Case: Best for tasks where exact answer matching is required (e.g., factoid QA)
QAF1Score
Measures token-level overlap between predicted and gold answers using F1 score (harmonic mean of precision and recall).Usage
Parameters
Global configuration object (optional)
Methods
calculate_metric_scores
Calculates the F1 score based on token overlap. Signature:List of lists containing ground truth answers.
List of predicted answers.
Function to aggregate scores across multiple gold answers.
Dict[str, float]: Pooled results with averaged F1 scoreList[Dict[str, float]]: Per-example F1 scores
Interpretation
- Score Range: 0.0 to 1.0
- Higher is Better: Yes
- Calculation: F1 = 2 * (precision * recall) / (precision + recall)
- Precision: Fraction of predicted tokens that appear in gold answer
- Recall: Fraction of gold answer tokens that appear in prediction
- Use Case: Better for partial credit when answers are similar but not exact matches
QABleu1Score
Evaluates answer quality using BLEU-1 (unigram precision) score.Usage
Parameters
Global configuration object (optional)
Methods
calculate_metric_scores
Calculates the BLEU-1 score between predicted and gold answers. Signature:List of lists containing ground truth answers.
List of predicted answers.
Function to aggregate scores across multiple gold answers.
Dict[str, float]: Pooled results with averaged BLEU-1 scoreList[Dict[str, float]]: Per-example BLEU-1 scores
Interpretation
- Score Range: 0.0 to 1.0
- Higher is Better: Yes
- Measures: Unigram precision (how many individual words match)
- Use Case: Good for measuring word-level overlap with brevity consideration
Requires the
evaluate library: pip install evaluateQABleu4Score
Evaluates answer quality using BLEU-4 (up to 4-gram precision) score.Usage
Parameters
Global configuration object (optional)
Methods
calculate_metric_scores
Calculates the BLEU-4 score between predicted and gold answers. Signature:List of lists containing ground truth answers.
List of predicted answers.
Function to aggregate scores across multiple gold answers.
Dict[str, float]: Pooled results with averaged BLEU-4 scoreList[Dict[str, float]]: Per-example BLEU-4 scores
calculate_corpus_bleu
Calculate corpus-level BLEU score (alternative evaluation method). Signature:Interpretation
- Score Range: 0.0 to 1.0
- Higher is Better: Yes
- Measures: N-gram precision up to 4-grams (captures phrase-level similarity)
- Use Case: Better for longer answers where phrase structure matters
- Note: More strict than BLEU-1; requires longer matching sequences
Requires the
evaluate library: pip install evaluateCommon Patterns
Multiple Gold Answers
All metrics support multiple acceptable answers per example:Custom Aggregation
By default, metrics usenp.max to take the best score across gold answers. You can customize this:
Answer Normalization
EM and F1 metrics automatically normalize answers by:- Converting to lowercase
- Removing articles (a, an, the)
- Removing punctuation
- Removing extra whitespace