Skip to main content
QA evaluation metrics measure the quality of generated answers against gold standard answers. These metrics are commonly used to evaluate question-answering systems.

QAExactMatch

Measures whether the predicted answer exactly matches any of the gold answers after normalization.

Usage

from remem.evaluation.qa_eval import QAExactMatch

metric = QAExactMatch()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["Paris", "paris"], ["42"]],
    predicted_answers=["Paris", "forty-two"]
)
print(pooled_results)  # {"ExactMatch": 0.5}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the Exact Match (EM) score. Signature:
def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_answers
List[List[str]]
required
List of lists containing ground truth answers. Each inner list contains multiple acceptable answers for that example.
predicted_answers
List[str]
required
List of predicted answers, one per example.
aggregation_fn
Callable
default:"np.max"
Function to aggregate scores across multiple gold answers. Defaults to taking the maximum score.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with the averaged EM score across all examples
  • List[Dict[str, float]]: Per-example results with EM scores

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Perfect Score: 1.0 means all predicted answers exactly match at least one gold answer
  • Use Case: Best for tasks where exact answer matching is required (e.g., factoid QA)

QAF1Score

Measures token-level overlap between predicted and gold answers using F1 score (harmonic mean of precision and recall).

Usage

from remem.evaluation.qa_eval import QAF1Score

metric = QAF1Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The capital of France is Paris"]],
    predicted_answers=["Paris is the capital"]
)
print(pooled_results)  # {"F1": 0.667}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the F1 score based on token overlap. Signature:
def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_answers
List[List[str]]
required
List of lists containing ground truth answers.
predicted_answers
List[str]
required
List of predicted answers.
aggregation_fn
Callable
default:"np.max"
Function to aggregate scores across multiple gold answers.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged F1 score
  • List[Dict[str, float]]: Per-example F1 scores

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Calculation: F1 = 2 * (precision * recall) / (precision + recall)
    • Precision: Fraction of predicted tokens that appear in gold answer
    • Recall: Fraction of gold answer tokens that appear in prediction
  • Use Case: Better for partial credit when answers are similar but not exact matches

QABleu1Score

Evaluates answer quality using BLEU-1 (unigram precision) score.

Usage

from remem.evaluation.qa_bleu import QABleu1Score

metric = QABleu1Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The Eiffel Tower is in Paris"]],
    predicted_answers=["The tower is in Paris"]
)
print(pooled_results)  # {"BLEU-1": 0.8}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the BLEU-1 score between predicted and gold answers. Signature:
def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_answers
List[List[str]]
required
List of lists containing ground truth answers.
predicted_answers
List[str]
required
List of predicted answers.
aggregation_fn
Callable
default:"np.max"
Function to aggregate scores across multiple gold answers.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged BLEU-1 score
  • List[Dict[str, float]]: Per-example BLEU-1 scores

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Measures: Unigram precision (how many individual words match)
  • Use Case: Good for measuring word-level overlap with brevity consideration
Requires the evaluate library: pip install evaluate

QABleu4Score

Evaluates answer quality using BLEU-4 (up to 4-gram precision) score.

Usage

from remem.evaluation.qa_bleu import QABleu4Score

metric = QABleu4Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The Eiffel Tower is located in Paris, France"]],
    predicted_answers=["The Eiffel Tower is in Paris"]
)
print(pooled_results)  # {"BLEU-4": ...}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the BLEU-4 score between predicted and gold answers. Signature:
def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_answers
List[List[str]]
required
List of lists containing ground truth answers.
predicted_answers
List[str]
required
List of predicted answers.
aggregation_fn
Callable
default:"np.max"
Function to aggregate scores across multiple gold answers.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged BLEU-4 score
  • List[Dict[str, float]]: Per-example BLEU-4 scores

calculate_corpus_bleu

Calculate corpus-level BLEU score (alternative evaluation method). Signature:
def calculate_corpus_bleu(
    gold_answers: List[List[str]],
    predicted_answers: List[str]
) -> Dict[str, float]
Returns: Dictionary containing the corpus BLEU score computed over the entire corpus rather than averaging individual sentence-level scores.

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Measures: N-gram precision up to 4-grams (captures phrase-level similarity)
  • Use Case: Better for longer answers where phrase structure matters
  • Note: More strict than BLEU-1; requires longer matching sequences
Requires the evaluate library: pip install evaluate

Common Patterns

Multiple Gold Answers

All metrics support multiple acceptable answers per example:
gold_answers = [
    ["Paris", "paris", "PARIS"],  # Multiple acceptable forms
    ["42", "forty-two", "forty two"]  # Different representations
]
predicted_answers = ["paris", "42"]

em_metric = QAExactMatch()
results, _ = em_metric.calculate_metric_scores(gold_answers, predicted_answers)

Custom Aggregation

By default, metrics use np.max to take the best score across gold answers. You can customize this:
import numpy as np

# Use mean instead of max
metric = QAF1Score()
results, _ = metric.calculate_metric_scores(
    gold_answers=[["answer1", "answer2"]],
    predicted_answers=["answer1"],
    aggregation_fn=np.mean  # Average across gold answers
)

Answer Normalization

EM and F1 metrics automatically normalize answers by:
  • Converting to lowercase
  • Removing articles (a, an, the)
  • Removing punctuation
  • Removing extra whitespace
This ensures fair comparison across formatting variations.

Build docs developers (and LLMs) love