Retrieval Evaluation Metrics

Retrieval evaluation metrics measure how well a retrieval system ranks and returns relevant documents. These metrics are essential for evaluating RAG (Retrieval-Augmented Generation) systems.

RetrievalRecall

Measures the proportion of relevant documents that appear in the top-k retrieved results.

Usage

from remem.evaluation.retrieval_eval import RetrievalRecall

metric = RetrievalRecall()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_docs=[["doc1", "doc2"], ["doc3"]],
    retrieved_chunks=[["doc1", "doc3", "doc2"], ["doc4", "doc3"]],
    k_list=[1, 5, 10]
)
print(pooled_results)
# {"Recall@1": 0.5, "Recall@5": 1.0, "Recall@10": 1.0}

Parameters

global_config

Optional[BaseConfig]

Global configuration object (optional)

Methods

calculate_metric_scores

Calculates Recall@k for each example and pools results across all queries. Signature:

def calculate_metric_scores(
    gold_docs: List[List[str]],
    retrieved_chunks: List[List[str]],
    k_list: List[int] = [1, 5, 10, 20]
) -> Tuple[Dict[str, float], List[Dict[str, float]]]

Arguments:

gold_docs

List[List[str]]

required

List of lists containing ground truth (relevant documents) for each query. Each inner list contains the document IDs or content that should be retrieved.

retrieved_chunks

List[List[str]]

required

List of lists containing retrieved documents for each query, in ranked order (most relevant first).

k_list

List[int]

default:"[1, 5, 10, 20]"

List of k values to calculate Recall@k for. Results are computed for each k value.

Returns: A tuple containing:

Dict[str, float]: Pooled results with averaged Recall@k across all examples
List[Dict[str, float]]: Per-example Recall@k scores

Interpretation

Score Range: 0.0 to 1.0
Higher is Better: Yes
Formula: Recall@k = (# relevant docs in top-k) / (# total relevant docs)
Use Case: Measures the system’s ability to retrieve all relevant documents within the top-k results
Note: A score of 1.0 means all relevant documents were found in the top-k

Example Output

pooled_results = {
    "Recall@1": 0.3333,
    "Recall@5": 0.6667,
    "Recall@10": 0.8333
}

example_results = [
    {"Recall@1": 0.5, "Recall@5": 1.0, "Recall@10": 1.0},
    {"Recall@1": 0.0, "Recall@5": 0.5, "Recall@10": 1.0}
]

RetrievalRecallAll

Measures whether ALL relevant documents are retrieved in the top-k results (strict binary metric).

Usage

from remem.evaluation.retrieval_eval import RetrievalRecallAll

metric = RetrievalRecallAll()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_docs=[["doc1", "doc2"], ["doc3"]],
    retrieved_chunks=[["doc1", "doc3", "doc2"], ["doc3", "doc4"]],
    k_list=[1, 5, 10]
)
print(pooled_results)
# {"Recall_all@1": 0.0, "Recall_all@5": 1.0, "Recall_all@10": 1.0}

Parameters

global_config

Optional[BaseConfig]

Global configuration object (optional)

Methods

calculate_metric_scores

Calculates Recall_all@k for each example and pools results across all queries. Signature:

def calculate_metric_scores(
    gold_docs: List[List[str]],
    retrieved_chunks: List[List[str]],
    k_list: List[int] = [1, 5, 10, 20]
) -> Tuple[Dict[str, float], List[Dict[str, float]]]

Arguments:

gold_docs

List[List[str]]

required

List of lists containing ground truth (relevant documents) for each query.

retrieved_chunks

List[List[str]]

required

List of lists containing retrieved document chunks for each query, in ranked order.

k_list

List[int]

default:"[1, 5, 10, 20]"

List of k values to calculate Recall_all@k for.

Returns: A tuple containing:

Dict[str, float]: Pooled results with averaged Recall_all@k across all examples
List[Dict[str, float]]: Per-example Recall_all@k scores

Interpretation

Score Range: 0.0 or 1.0 per example (binary)
Higher is Better: Yes
Formula: Recall_all@k = 1.0 if all relevant docs are in top-k, else 0.0
Use Case: Strict metric for tasks requiring complete retrieval of all relevant documents
Difference from RetrievalRecall: This metric gives 0.0 if even one relevant document is missing from top-k
Chunk Matching: Uses fuzzy matching to determine if a retrieved chunk originates from a gold document

Example Output

# Example 1: Has 2 relevant docs, both found in top-5 → 1.0
# Example 2: Has 2 relevant docs, only 1 found in top-5 → 0.0
pooled_results = {
    "Recall_all@1": 0.0,
    "Recall_all@5": 0.5,  # 50% of examples have all docs in top-5
    "Recall_all@10": 1.0   # All examples have all docs in top-10
}

RetrievalNDCGAny

Measures ranking quality using Normalized Discounted Cumulative Gain (NDCG), considering the position of relevant documents.

Usage

from remem.evaluation.retrieval_eval import RetrievalNDCGAny

metric = RetrievalNDCGAny()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_docs=[["doc1", "doc2"], ["doc3"]],
    retrieved_chunks=[["doc1", "doc2", "doc4"], ["doc3", "doc5"]],
    k_list=[1, 5, 10]
)
print(pooled_results)
# {"NDCG_any@1": 0.75, "NDCG_any@5": 0.85, "NDCG_any@10": 0.87}

Parameters

global_config

Optional[BaseConfig]

Global configuration object (optional)

Methods

calculate_metric_scores

Calculates NDCG@k for each example and pools results across all queries. Signature:

def calculate_metric_scores(
    gold_docs: List[List[str]],
    retrieved_chunks: List[List[str]],
    k_list: List[int] = [1, 5, 10, 20]
) -> Tuple[Dict[str, float], List[Dict[str, float]]]

Arguments:

gold_docs

List[List[str]]

required

List of lists containing ground truth (relevant document identifiers) for each query.

retrieved_chunks

List[List[str]]

required

List of lists containing retrieved document identifiers for each query, in ranked order.

k_list

List[int]

default:"[1, 5, 10, 20]"

List of cutoff ranks (k values) to calculate NDCG@k for.

Returns: A tuple containing:

Dict[str, float]: Pooled results with averaged NDCG@k across all examples
List[Dict[str, float]]: Per-example NDCG@k scores

Interpretation

Score Range: 0.0 to 1.0
Higher is Better: Yes
Formula: NDCG@k = DCG@k / IDCG@k
- DCG (Discounted Cumulative Gain): Sum of relevances weighted by logarithmic position discount
- IDCG (Ideal DCG): DCG of the perfect ranking
Use Case: Measures both retrieval quality AND ranking quality
Key Insight: Rewards systems that rank relevant documents higher in the results
Position Matters: Finding a relevant document at position 1 scores higher than at position 10

How NDCG Works

Relevance Assignment: Each retrieved document gets a relevance score (1 if matches any gold doc, 0 otherwise)
DCG Calculation: DCG = rel[0] + Σ(rel[i] / log2(i+1)) for i=1 to k-1
Ideal DCG: Sort all relevances in descending order and compute DCG
Normalization: NDCG = DCG / IDCG

Example Output

# Example with perfect ranking (relevant docs at top)
example_results = [
    {"NDCG_any@1": 1.0, "NDCG_any@5": 1.0, "NDCG_any@10": 1.0},
    {"NDCG_any@1": 0.0, "NDCG_any@5": 0.43, "NDCG_any@10": 0.56}
]

pooled_results = {
    "NDCG_any@1": 0.5,
    "NDCG_any@5": 0.715,
    "NDCG_any@10": 0.78
}

Comparison of Metrics

Metric	What It Measures	Best For	Considers Ranking
RetrievalRecall	% of relevant docs in top-k	Coverage of relevant documents	No
RetrievalRecallAll	Whether ALL relevant docs are in top-k	Tasks requiring complete retrieval	No
RetrievalNDCGAny	Ranking quality of relevant docs	Overall retrieval quality with ranking	Yes

When to Use Each Metric

Use RetrievalRecall when you want to know what fraction of relevant documents are being found
Use RetrievalRecallAll when you need ALL relevant documents (e.g., critical information retrieval)
Use RetrievalNDCGAny when ranking matters and you want to reward systems that place relevant documents higher

Common Patterns

Evaluating at Multiple K Values

All metrics support evaluating at multiple cutoff values simultaneously:

metric = RetrievalRecall()
results, _ = metric.calculate_metric_scores(
    gold_docs=[["doc1", "doc2"]],
    retrieved_chunks=[["doc1", "doc5", "doc2", "doc3"]],
    k_list=[1, 2, 5, 10, 20]  # Evaluate at multiple k values
)

print(results)
# {
#   "Recall@1": 0.5,  # Found 1 of 2 docs
#   "Recall@2": 0.5,  # Still 1 of 2
#   "Recall@5": 1.0,  # Found both docs
#   "Recall@10": 1.0,
#   "Recall@20": 1.0
# }

Document Matching

The metrics use intelligent chunk matching via is_chunk_from_original() which:

Handles date prefixes (strips “Date: ” lines)
Handles conversation format (“user: ” / “assistant: ” prefixes)
Performs exact string matching or JSON-escaped matching
Allows fuzzy matching between chunks and original documents

Per-Example vs Pooled Results

All metrics return both aggregated and per-example results:

pooled_results, example_results = metric.calculate_metric_scores(...)

# Pooled: averaged across all examples
print(pooled_results)
# {"Recall@5": 0.75}

# Per-example: scores for each individual query
print(example_results)
# [
#   {"Recall@5": 1.0},
#   {"Recall@5": 0.5}
# ]

This allows you to:

Report overall system performance (pooled)
Identify problematic queries (per-example)
Analyze performance distribution

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

Retrieval Evaluation Metrics

RetrievalRecall

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

Example Output

RetrievalRecallAll

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

Example Output

RetrievalNDCGAny

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

How NDCG Works

Example Output

Comparison of Metrics

When to Use Each Metric

Common Patterns

Evaluating at Multiple K Values

Document Matching

Per-Example vs Pooled Results

Build docs developers (and LLMs) love

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

​RetrievalRecall

​Usage

​Parameters

​Methods

​calculate_metric_scores

​Interpretation

​Example Output

​RetrievalRecallAll

​Usage

​Parameters

​Methods

​calculate_metric_scores

​Interpretation

​Example Output

​RetrievalNDCGAny

​Usage

​Parameters

​Methods

​calculate_metric_scores

​Interpretation

​How NDCG Works

​Example Output

​Comparison of Metrics

​When to Use Each Metric

​Common Patterns

​Evaluating at Multiple K Values

​Document Matching

​Per-Example vs Pooled Results

Build docs developers (and LLMs) love

RetrievalRecall

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

Example Output

RetrievalRecallAll

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

Example Output

RetrievalNDCGAny

Usage

Parameters

Methods

calculate_metric_scores

Interpretation

How NDCG Works

Example Output

Comparison of Metrics

When to Use Each Metric

Common Patterns

Evaluating at Multiple K Values

Document Matching

Per-Example vs Pooled Results