RetrievalRecall
Measures the proportion of relevant documents that appear in the top-k retrieved results.Usage
Parameters
Global configuration object (optional)
Methods
calculate_metric_scores
Calculates Recall@k for each example and pools results across all queries. Signature:List of lists containing ground truth (relevant documents) for each query. Each inner list contains the document IDs or content that should be retrieved.
List of lists containing retrieved documents for each query, in ranked order (most relevant first).
List of k values to calculate Recall@k for. Results are computed for each k value.
Dict[str, float]: Pooled results with averaged Recall@k across all examplesList[Dict[str, float]]: Per-example Recall@k scores
Interpretation
- Score Range: 0.0 to 1.0
- Higher is Better: Yes
- Formula: Recall@k = (# relevant docs in top-k) / (# total relevant docs)
- Use Case: Measures the system’s ability to retrieve all relevant documents within the top-k results
- Note: A score of 1.0 means all relevant documents were found in the top-k
Example Output
RetrievalRecallAll
Measures whether ALL relevant documents are retrieved in the top-k results (strict binary metric).Usage
Parameters
Global configuration object (optional)
Methods
calculate_metric_scores
Calculates Recall_all@k for each example and pools results across all queries. Signature:List of lists containing ground truth (relevant documents) for each query.
List of lists containing retrieved document chunks for each query, in ranked order.
List of k values to calculate Recall_all@k for.
Dict[str, float]: Pooled results with averaged Recall_all@k across all examplesList[Dict[str, float]]: Per-example Recall_all@k scores
Interpretation
- Score Range: 0.0 or 1.0 per example (binary)
- Higher is Better: Yes
- Formula: Recall_all@k = 1.0 if all relevant docs are in top-k, else 0.0
- Use Case: Strict metric for tasks requiring complete retrieval of all relevant documents
- Difference from RetrievalRecall: This metric gives 0.0 if even one relevant document is missing from top-k
- Chunk Matching: Uses fuzzy matching to determine if a retrieved chunk originates from a gold document
Example Output
RetrievalNDCGAny
Measures ranking quality using Normalized Discounted Cumulative Gain (NDCG), considering the position of relevant documents.Usage
Parameters
Global configuration object (optional)
Methods
calculate_metric_scores
Calculates NDCG@k for each example and pools results across all queries. Signature:List of lists containing ground truth (relevant document identifiers) for each query.
List of lists containing retrieved document identifiers for each query, in ranked order.
List of cutoff ranks (k values) to calculate NDCG@k for.
Dict[str, float]: Pooled results with averaged NDCG@k across all examplesList[Dict[str, float]]: Per-example NDCG@k scores
Interpretation
- Score Range: 0.0 to 1.0
- Higher is Better: Yes
- Formula: NDCG@k = DCG@k / IDCG@k
- DCG (Discounted Cumulative Gain): Sum of relevances weighted by logarithmic position discount
- IDCG (Ideal DCG): DCG of the perfect ranking
- Use Case: Measures both retrieval quality AND ranking quality
- Key Insight: Rewards systems that rank relevant documents higher in the results
- Position Matters: Finding a relevant document at position 1 scores higher than at position 10
How NDCG Works
- Relevance Assignment: Each retrieved document gets a relevance score (1 if matches any gold doc, 0 otherwise)
- DCG Calculation: DCG = rel[0] + Σ(rel[i] / log2(i+1)) for i=1 to k-1
- Ideal DCG: Sort all relevances in descending order and compute DCG
- Normalization: NDCG = DCG / IDCG
Example Output
Comparison of Metrics
| Metric | What It Measures | Best For | Considers Ranking |
|---|---|---|---|
| RetrievalRecall | % of relevant docs in top-k | Coverage of relevant documents | No |
| RetrievalRecallAll | Whether ALL relevant docs are in top-k | Tasks requiring complete retrieval | No |
| RetrievalNDCGAny | Ranking quality of relevant docs | Overall retrieval quality with ranking | Yes |
When to Use Each Metric
- Use RetrievalRecall when you want to know what fraction of relevant documents are being found
- Use RetrievalRecallAll when you need ALL relevant documents (e.g., critical information retrieval)
- Use RetrievalNDCGAny when ranking matters and you want to reward systems that place relevant documents higher
Common Patterns
Evaluating at Multiple K Values
All metrics support evaluating at multiple cutoff values simultaneously:Document Matching
The metrics use intelligent chunk matching viais_chunk_from_original() which:
- Handles date prefixes (strips “Date: ” lines)
- Handles conversation format (“user: ” / “assistant: ” prefixes)
- Performs exact string matching or JSON-escaped matching
- Allows fuzzy matching between chunks and original documents
Per-Example vs Pooled Results
All metrics return both aggregated and per-example results:- Report overall system performance (pooled)
- Identify problematic queries (per-example)
- Analyze performance distribution