Skip to main content
Retrieval evaluation metrics measure how well a retrieval system ranks and returns relevant documents. These metrics are essential for evaluating RAG (Retrieval-Augmented Generation) systems.

RetrievalRecall

Measures the proportion of relevant documents that appear in the top-k retrieved results.

Usage

from remem.evaluation.retrieval_eval import RetrievalRecall

metric = RetrievalRecall()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_docs=[["doc1", "doc2"], ["doc3"]],
    retrieved_chunks=[["doc1", "doc3", "doc2"], ["doc4", "doc3"]],
    k_list=[1, 5, 10]
)
print(pooled_results)
# {"Recall@1": 0.5, "Recall@5": 1.0, "Recall@10": 1.0}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates Recall@k for each example and pools results across all queries. Signature:
def calculate_metric_scores(
    gold_docs: List[List[str]],
    retrieved_chunks: List[List[str]],
    k_list: List[int] = [1, 5, 10, 20]
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_docs
List[List[str]]
required
List of lists containing ground truth (relevant documents) for each query. Each inner list contains the document IDs or content that should be retrieved.
retrieved_chunks
List[List[str]]
required
List of lists containing retrieved documents for each query, in ranked order (most relevant first).
k_list
List[int]
default:"[1, 5, 10, 20]"
List of k values to calculate Recall@k for. Results are computed for each k value.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged Recall@k across all examples
  • List[Dict[str, float]]: Per-example Recall@k scores

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Formula: Recall@k = (# relevant docs in top-k) / (# total relevant docs)
  • Use Case: Measures the system’s ability to retrieve all relevant documents within the top-k results
  • Note: A score of 1.0 means all relevant documents were found in the top-k

Example Output

pooled_results = {
    "Recall@1": 0.3333,
    "Recall@5": 0.6667,
    "Recall@10": 0.8333
}

example_results = [
    {"Recall@1": 0.5, "Recall@5": 1.0, "Recall@10": 1.0},
    {"Recall@1": 0.0, "Recall@5": 0.5, "Recall@10": 1.0}
]

RetrievalRecallAll

Measures whether ALL relevant documents are retrieved in the top-k results (strict binary metric).

Usage

from remem.evaluation.retrieval_eval import RetrievalRecallAll

metric = RetrievalRecallAll()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_docs=[["doc1", "doc2"], ["doc3"]],
    retrieved_chunks=[["doc1", "doc3", "doc2"], ["doc3", "doc4"]],
    k_list=[1, 5, 10]
)
print(pooled_results)
# {"Recall_all@1": 0.0, "Recall_all@5": 1.0, "Recall_all@10": 1.0}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates Recall_all@k for each example and pools results across all queries. Signature:
def calculate_metric_scores(
    gold_docs: List[List[str]],
    retrieved_chunks: List[List[str]],
    k_list: List[int] = [1, 5, 10, 20]
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_docs
List[List[str]]
required
List of lists containing ground truth (relevant documents) for each query.
retrieved_chunks
List[List[str]]
required
List of lists containing retrieved document chunks for each query, in ranked order.
k_list
List[int]
default:"[1, 5, 10, 20]"
List of k values to calculate Recall_all@k for.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged Recall_all@k across all examples
  • List[Dict[str, float]]: Per-example Recall_all@k scores

Interpretation

  • Score Range: 0.0 or 1.0 per example (binary)
  • Higher is Better: Yes
  • Formula: Recall_all@k = 1.0 if all relevant docs are in top-k, else 0.0
  • Use Case: Strict metric for tasks requiring complete retrieval of all relevant documents
  • Difference from RetrievalRecall: This metric gives 0.0 if even one relevant document is missing from top-k
  • Chunk Matching: Uses fuzzy matching to determine if a retrieved chunk originates from a gold document

Example Output

# Example 1: Has 2 relevant docs, both found in top-5 → 1.0
# Example 2: Has 2 relevant docs, only 1 found in top-5 → 0.0
pooled_results = {
    "Recall_all@1": 0.0,
    "Recall_all@5": 0.5,  # 50% of examples have all docs in top-5
    "Recall_all@10": 1.0   # All examples have all docs in top-10
}

RetrievalNDCGAny

Measures ranking quality using Normalized Discounted Cumulative Gain (NDCG), considering the position of relevant documents.

Usage

from remem.evaluation.retrieval_eval import RetrievalNDCGAny

metric = RetrievalNDCGAny()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_docs=[["doc1", "doc2"], ["doc3"]],
    retrieved_chunks=[["doc1", "doc2", "doc4"], ["doc3", "doc5"]],
    k_list=[1, 5, 10]
)
print(pooled_results)
# {"NDCG_any@1": 0.75, "NDCG_any@5": 0.85, "NDCG_any@10": 0.87}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates NDCG@k for each example and pools results across all queries. Signature:
def calculate_metric_scores(
    gold_docs: List[List[str]],
    retrieved_chunks: List[List[str]],
    k_list: List[int] = [1, 5, 10, 20]
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_docs
List[List[str]]
required
List of lists containing ground truth (relevant document identifiers) for each query.
retrieved_chunks
List[List[str]]
required
List of lists containing retrieved document identifiers for each query, in ranked order.
k_list
List[int]
default:"[1, 5, 10, 20]"
List of cutoff ranks (k values) to calculate NDCG@k for.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged NDCG@k across all examples
  • List[Dict[str, float]]: Per-example NDCG@k scores

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Formula: NDCG@k = DCG@k / IDCG@k
    • DCG (Discounted Cumulative Gain): Sum of relevances weighted by logarithmic position discount
    • IDCG (Ideal DCG): DCG of the perfect ranking
  • Use Case: Measures both retrieval quality AND ranking quality
  • Key Insight: Rewards systems that rank relevant documents higher in the results
  • Position Matters: Finding a relevant document at position 1 scores higher than at position 10

How NDCG Works

  1. Relevance Assignment: Each retrieved document gets a relevance score (1 if matches any gold doc, 0 otherwise)
  2. DCG Calculation: DCG = rel[0] + Σ(rel[i] / log2(i+1)) for i=1 to k-1
  3. Ideal DCG: Sort all relevances in descending order and compute DCG
  4. Normalization: NDCG = DCG / IDCG

Example Output

# Example with perfect ranking (relevant docs at top)
example_results = [
    {"NDCG_any@1": 1.0, "NDCG_any@5": 1.0, "NDCG_any@10": 1.0},
    {"NDCG_any@1": 0.0, "NDCG_any@5": 0.43, "NDCG_any@10": 0.56}
]

pooled_results = {
    "NDCG_any@1": 0.5,
    "NDCG_any@5": 0.715,
    "NDCG_any@10": 0.78
}

Comparison of Metrics

MetricWhat It MeasuresBest ForConsiders Ranking
RetrievalRecall% of relevant docs in top-kCoverage of relevant documentsNo
RetrievalRecallAllWhether ALL relevant docs are in top-kTasks requiring complete retrievalNo
RetrievalNDCGAnyRanking quality of relevant docsOverall retrieval quality with rankingYes

When to Use Each Metric

  • Use RetrievalRecall when you want to know what fraction of relevant documents are being found
  • Use RetrievalRecallAll when you need ALL relevant documents (e.g., critical information retrieval)
  • Use RetrievalNDCGAny when ranking matters and you want to reward systems that place relevant documents higher

Common Patterns

Evaluating at Multiple K Values

All metrics support evaluating at multiple cutoff values simultaneously:
metric = RetrievalRecall()
results, _ = metric.calculate_metric_scores(
    gold_docs=[["doc1", "doc2"]],
    retrieved_chunks=[["doc1", "doc5", "doc2", "doc3"]],
    k_list=[1, 2, 5, 10, 20]  # Evaluate at multiple k values
)

print(results)
# {
#   "Recall@1": 0.5,  # Found 1 of 2 docs
#   "Recall@2": 0.5,  # Still 1 of 2
#   "Recall@5": 1.0,  # Found both docs
#   "Recall@10": 1.0,
#   "Recall@20": 1.0
# }

Document Matching

The metrics use intelligent chunk matching via is_chunk_from_original() which:
  • Handles date prefixes (strips “Date: ” lines)
  • Handles conversation format (“user: ” / “assistant: ” prefixes)
  • Performs exact string matching or JSON-escaped matching
  • Allows fuzzy matching between chunks and original documents

Per-Example vs Pooled Results

All metrics return both aggregated and per-example results:
pooled_results, example_results = metric.calculate_metric_scores(...)

# Pooled: averaged across all examples
print(pooled_results)
# {"Recall@5": 0.75}

# Per-example: scores for each individual query
print(example_results)
# [
#   {"Recall@5": 1.0},
#   {"Recall@5": 0.5}
# ]
This allows you to:
  • Report overall system performance (pooled)
  • Identify problematic queries (per-example)
  • Analyze performance distribution

Build docs developers (and LLMs) love