QA Evaluation Metrics

QA evaluation metrics measure the quality of generated answers against gold standard answers. These metrics are commonly used to evaluate question-answering systems.

QAExactMatch

Measures whether the predicted answer exactly matches any of the gold answers after normalization.

Usage

from remem.evaluation.qa_eval import QAExactMatch

metric = QAExactMatch()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["Paris", "paris"], ["42"]],
    predicted_answers=["Paris", "forty-two"]
)
print(pooled_results)  # {"ExactMatch": 0.5}

Parameters

global_config

Optional[BaseConfig]

Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the Exact Match (EM) score. Signature:

def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]

Arguments:

gold_answers

List[List[str]]

required

List of lists containing ground truth answers. Each inner list contains multiple acceptable answers for that example.

predicted_answers

List[str]

required

List of predicted answers, one per example.

aggregation_fn

Callable

default:"np.max"

Function to aggregate scores across multiple gold answers. Defaults to taking the maximum score.

Returns: A tuple containing:

Dict[str, float]: Pooled results with the averaged EM score across all examples
List[Dict[str, float]]: Per-example results with EM scores

Interpretation

Score Range: 0.0 to 1.0
Higher is Better: Yes
Perfect Score: 1.0 means all predicted answers exactly match at least one gold answer
Use Case: Best for tasks where exact answer matching is required (e.g., factoid QA)

QAF1Score

Measures token-level overlap between predicted and gold answers using F1 score (harmonic mean of precision and recall).

Usage

from remem.evaluation.qa_eval import QAF1Score

metric = QAF1Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The capital of France is Paris"]],
    predicted_answers=["Paris is the capital"]
)
print(pooled_results)  # {"F1": 0.667}

Parameters

global_config

Optional[BaseConfig]

Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the F1 score based on token overlap. Signature:

def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]

Arguments:

gold_answers

List[List[str]]

required

List of lists containing ground truth answers.

predicted_answers

List[str]

required

List of predicted answers.

aggregation_fn

Callable

default:"np.max"

Function to aggregate scores across multiple gold answers.

Returns: A tuple containing:

Dict[str, float]: Pooled results with averaged F1 score
List[Dict[str, float]]: Per-example F1 scores

Interpretation

Score Range: 0.0 to 1.0
Higher is Better: Yes
Calculation: F1 = 2 * (precision * recall) / (precision + recall)
- Precision: Fraction of predicted tokens that appear in gold answer
- Recall: Fraction of gold answer tokens that appear in prediction
Use Case: Better for partial credit when answers are similar but not exact matches

QABleu1Score

Evaluates answer quality using BLEU-1 (unigram precision) score.

Usage

from remem.evaluation.qa_bleu import QABleu1Score

metric = QABleu1Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The Eiffel Tower is in Paris"]],
    predicted_answers=["The tower is in Paris"]
)
print(pooled_results)  # {"BLEU-1": 0.8}

Parameters

global_config

Optional[BaseConfig]

Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the BLEU-1 score between predicted and gold answers. Signature:

def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]

Arguments:

gold_answers

List[List[str]]

required

List of lists containing ground truth answers.

predicted_answers

List[str]

required

List of predicted answers.

aggregation_fn

Callable

default:"np.max"

Function to aggregate scores across multiple gold answers.

Returns: A tuple containing:

Dict[str, float]: Pooled results with averaged BLEU-1 score
List[Dict[str, float]]: Per-example BLEU-1 scores

Interpretation

Score Range: 0.0 to 1.0
Higher is Better: Yes
Measures: Unigram precision (how many individual words match)
Use Case: Good for measuring word-level overlap with brevity consideration

Requires the evaluate library: pip install evaluate

QABleu4Score

Evaluates answer quality using BLEU-4 (up to 4-gram precision) score.

Usage

from remem.evaluation.qa_bleu import QABleu4Score

metric = QABleu4Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The Eiffel Tower is located in Paris, France"]],
    predicted_answers=["The Eiffel Tower is in Paris"]
)
print(pooled_results)  # {"BLEU-4": ...}

Parameters

global_config

Optional[BaseConfig]

Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the BLEU-4 score between predicted and gold answers. Signature:

def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]

Arguments:

gold_answers

List[List[str]]

required

List of lists containing ground truth answers.

predicted_answers

List[str]

required

List of predicted answers.

aggregation_fn

Callable

default:"np.max"

Function to aggregate scores across multiple gold answers.

Returns: A tuple containing:

Dict[str, float]: Pooled results with averaged BLEU-4 score
List[Dict[str, float]]: Per-example BLEU-4 scores

calculate_corpus_bleu

Calculate corpus-level BLEU score (alternative evaluation method). Signature:

def calculate_corpus_bleu(
    gold_answers: List[List[str]],
    predicted_answers: List[str]
) -> Dict[str, float]

Returns: Dictionary containing the corpus BLEU score computed over the entire corpus rather than averaging individual sentence-level scores.

Interpretation

Score Range: 0.0 to 1.0
Higher is Better: Yes
Measures: N-gram precision up to 4-grams (captures phrase-level similarity)
Use Case: Better for longer answers where phrase structure matters
Note: More strict than BLEU-1; requires longer matching sequences

Requires the evaluate library: pip install evaluate

Common Patterns

Multiple Gold Answers

All metrics support multiple acceptable answers per example:

gold_answers = [
    ["Paris", "paris", "PARIS"],  # Multiple acceptable forms
    ["42", "forty-two", "forty two"]  # Different representations
]
predicted_answers = ["paris", "42"]

em_metric = QAExactMatch()
results, _ = em_metric.calculate_metric_scores(gold_answers, predicted_answers)

Custom Aggregation

By default, metrics use np.max to take the best score across gold answers. You can customize this:

import numpy as np

# Use mean instead of max
metric = QAF1Score()
results, _ = metric.calculate_metric_scores(
    gold_answers=[["answer1", "answer2"]],
    predicted_answers=["answer1"],
    aggregation_fn=np.mean  # Average across gold answers
)

Answer Normalization

EM and F1 metrics automatically normalize answers by:

Converting to lowercase
Removing articles (a, an, the)
Removing punctuation
Removing extra whitespace

This ensures fair comparison across formatting variations.

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

QA Evaluation Metrics

QAExactMatch

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

QAF1Score

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

QABleu1Score

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

QABleu4Score

Usage

Parameters

Methods

calculate_metric_scores

calculate_corpus_bleu

Interpretation

Common Patterns

Multiple Gold Answers

Custom Aggregation

Answer Normalization

Build docs developers (and LLMs) love

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

​QAExactMatch

​Usage

​Parameters

​Methods

​calculate_metric_scores

​Interpretation

​QAF1Score

​Usage

​Parameters

​Methods

​calculate_metric_scores

​Interpretation

​QABleu1Score

​Usage

​Parameters

​Methods

​calculate_metric_scores

​Interpretation

​QABleu4Score

​Usage

​Parameters

​Methods

​calculate_metric_scores

​calculate_corpus_bleu

​Interpretation

​Common Patterns

​Multiple Gold Answers

​Custom Aggregation

​Answer Normalization

Build docs developers (and LLMs) love

QAExactMatch

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

QAF1Score

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

QABleu1Score

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

QABleu4Score

Usage

Parameters

Methods

calculate_metric_scores

calculate_corpus_bleu

Interpretation

Common Patterns

Multiple Gold Answers

Custom Aggregation

Answer Normalization