Skip to main content
The evaluation module provides standard information retrieval metrics for measuring the quality of document retrieval in RAG pipelines.

Overview

VectorDB implements five core retrieval metrics used in academic and industry benchmarking:
  • Recall@k - Fraction of relevant documents retrieved in top-k results
  • Precision@k - Fraction of top-k results that are relevant
  • MRR (Mean Reciprocal Rank) - Average of reciprocal ranks of first relevant document
  • NDCG@k (Normalized DCG) - Rank-aware metric normalized by ideal ranking
  • Hit Rate - Binary indicator if any relevant document appears in top-k
All metrics assume binary relevance (document is either relevant or not) and use 1-indexed ranks for mathematical correctness.

Metric formulas

Recall@k

Measures the proportion of relevant documents that were successfully retrieved:
Recall@k = |relevant ∩ retrieved_top_k| / |relevant|
Range: 0.0 to 1.0 (higher is better) Example: If there are 10 relevant documents and the top-5 results contain 3 of them, Recall@5 = 3/10 = 0.3

Precision@k

Measures the proportion of retrieved documents that are actually relevant:
Precision@k = |relevant ∩ retrieved_top_k| / k
Range: 0.0 to 1.0 (higher is better) Example: If the top-5 results contain 3 relevant documents, Precision@5 = 3/5 = 0.6

MRR (Mean Reciprocal Rank)

Measures how quickly the first relevant document appears:
MRR = mean(1 / rank_of_first_relevant)
Range: 0.0 to 1.0 (higher is better) Example: If the first relevant document appears at position 3, MRR = 1/3 = 0.333

NDCG@k (Normalized Discounted Cumulative Gain)

Rank-aware metric that gives more weight to relevant documents appearing earlier:
DCG@k = Σ(rel_i / log2(i + 2)) for i in range(k)
IDCG@k = DCG@k with all relevant docs ranked first (ideal)
NDCG@k = DCG@k / IDCG@k
Range: 0.0 to 1.0 (higher is better) Note: Uses log2(i + 2) because log2(1) = 0

Hit Rate

Binary success metric indicating whether any relevant document was retrieved:
Hit Rate = 1 if any relevant in top-k, else 0
Range: 0.0 to 1.0 (higher is better) Aggregation: Computed as proportion of queries with at least one hit

Data structures

RetrievalMetrics

Container for aggregated evaluation metrics:
@dataclass
class RetrievalMetrics:
    recall_at_k: float = 0.0      # Proportion of relevant docs retrieved
    precision_at_k: float = 0.0    # Proportion of retrieved docs that are relevant
    mrr: float = 0.0               # Mean reciprocal rank
    ndcg_at_k: float = 0.0         # Normalized discounted cumulative gain
    hit_rate: float = 0.0          # Proportion of queries with at least one hit
    num_queries: int = 0           # Number of queries evaluated
    k: int = 5                     # Cutoff value for top-k metrics
Usage:
from vectordb.utils.evaluation import RetrievalMetrics

metrics = RetrievalMetrics(
    recall_at_k=0.65,
    precision_at_k=0.80,
    mrr=0.72,
    ndcg_at_k=0.78,
    hit_rate=0.90,
    num_queries=100,
    k=5
)

# Convert to dictionary for JSON serialization
metrics_dict = metrics.to_dict()
# {"recall@5": 0.65, "precision@5": 0.80, "mrr": 0.72, ...}

QueryResult

Result for a single query evaluation:
@dataclass
class QueryResult:
    query: str                          # Query string
    retrieved_ids: list[str]            # Retrieved document IDs (ranked)
    retrieved_contents: list[str]       # Retrieved document contents
    relevant_ids: set[str]              # Ground truth relevant IDs
    scores: list[float]                 # Retrieval scores
Usage:
from vectordb.utils.evaluation import QueryResult

result = QueryResult(
    query="What is machine learning?",
    retrieved_ids=["doc1", "doc2", "doc3"],
    retrieved_contents=["ML is...", "AI involves...", "Deep learning..."],
    relevant_ids={"doc1", "doc4"},
    scores=[0.95, 0.87, 0.82]
)

EvaluationResult

Complete evaluation result for a retrieval pipeline:
@dataclass
class EvaluationResult:
    metrics: RetrievalMetrics            # Aggregated metrics
    query_results: list[QueryResult]     # Per-query results
    pipeline_name: str                   # Pipeline identifier
    dataset_name: str                    # Dataset identifier
    config: dict[str, Any]               # Configuration used
Usage:
from vectordb.utils.evaluation import EvaluationResult

eval_result = EvaluationResult(
    metrics=metrics,
    query_results=query_results,
    pipeline_name="semantic_search_pinecone",
    dataset_name="triviaqa",
    config={"top_k": 5, "backend": "pinecone"}
)

# Convert to dictionary
result_dict = eval_result.to_dict()

Computing metrics

Single-query metrics

Compute metrics for individual queries:
from vectordb.utils.evaluation import (
    compute_recall_at_k,
    compute_precision_at_k,
    compute_mrr,
    compute_ndcg_at_k,
    compute_hit_rate
)

retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
relevant = {"doc1", "doc3", "doc7"}
k = 5

recall = compute_recall_at_k(retrieved, relevant, k)
# 2/3 = 0.666 (found 2 out of 3 relevant docs)

precision = compute_precision_at_k(retrieved, relevant, k)
# 2/5 = 0.4 (2 out of 5 retrieved docs are relevant)

mrr = compute_mrr(retrieved, relevant)
# 1/1 = 1.0 (first relevant doc at position 1)

ndcg = compute_ndcg_at_k(retrieved, relevant, k)
# Rank-aware score considering positions of relevant docs

hit = compute_hit_rate(retrieved, relevant, k)
# 1.0 (at least one relevant doc in top-5)

Aggregated metrics

Compute metrics across multiple queries:
from vectordb.utils.evaluation import QueryResult, evaluate_retrieval

query_results = [
    QueryResult(
        query="query1",
        retrieved_ids=["doc1", "doc2", "doc3"],
        retrieved_contents=[],
        relevant_ids={"doc1", "doc4"},
        scores=[0.95, 0.87, 0.82]
    ),
    QueryResult(
        query="query2",
        retrieved_ids=["doc5", "doc6", "doc7"],
        retrieved_contents=[],
        relevant_ids={"doc7", "doc8"},
        scores=[0.91, 0.88, 0.85]
    ),
    # ... more queries
]

# Compute aggregated metrics
metrics = evaluate_retrieval(query_results, k=5)

print(f"Recall@5: {metrics.recall_at_k:.3f}")
print(f"Precision@5: {metrics.precision_at_k:.3f}")
print(f"MRR: {metrics.mrr:.3f}")
print(f"NDCG@5: {metrics.ndcg_at_k:.3f}")
print(f"Hit Rate: {metrics.hit_rate:.3f}")
print(f"Queries: {metrics.num_queries}")

Evaluation workflow

Typical evaluation pipeline:
from vectordb.dataloaders import DataloaderCatalog
from vectordb.utils.evaluation import QueryResult, evaluate_retrieval

# 1. Load dataset and evaluation queries
loader = DataloaderCatalog.create("triviaqa", split="test", limit=100)
dataset = loader.load()
eval_queries = extract_evaluation_queries(dataset)  # Custom function

# 2. Index documents in vector database
db.upsert(documents)

# 3. Execute evaluation queries
query_results = []
for eval_query in eval_queries:
    # Get query embedding
    query_vector = embed(eval_query.query)

    # Retrieve documents
    retrieved_docs = db.query(vector=query_vector, top_k=10)

    # Create QueryResult
    query_results.append(
        QueryResult(
            query=eval_query.query,
            retrieved_ids=[doc.id for doc in retrieved_docs],
            retrieved_contents=[doc.content for doc in retrieved_docs],
            relevant_ids=set(eval_query.relevant_doc_ids),
            scores=[doc.score for doc in retrieved_docs]
        )
    )

# 4. Compute metrics
metrics = evaluate_retrieval(query_results, k=10)

# 5. Display results
print(f"\nEvaluation Results:")
for key, value in metrics.to_dict().items():
    print(f"{key}: {value}")

Choosing k values

The cutoff k determines how many top results are considered:

k=5

Use case: Strict precision requirements
Scenario: Chat interfaces with limited context window

k=10

Use case: Balanced evaluation
Scenario: Standard RAG pipelines with reranking

k=20

Use case: Recall-focused evaluation
Scenario: Multi-stage retrieval (retrieve many, rerank to few)

k=100

Use case: First-stage retrieval quality
Scenario: Evaluating retriever before compression/filtering

Binary vs graded relevance

VectorDB metrics assume binary relevance (document is relevant or not):
# Binary relevance
relevant_ids = {"doc1", "doc3", "doc7"}  # Either relevant or not
For graded relevance (documents have relevance scores 0-3), you would need custom implementations. The current NDCG implementation treats all relevant documents as having relevance score 1.0.

Comparing pipelines

Evaluate multiple retrieval strategies:
from vectordb.utils.evaluation import evaluate_retrieval

pipelines = [
    {"name": "semantic_search", "results": semantic_results},
    {"name": "hybrid_search", "results": hybrid_results},
    {"name": "with_reranking", "results": reranked_results}
]

for pipeline in pipelines:
    metrics = evaluate_retrieval(pipeline["results"], k=10)
    print(f"\n{pipeline['name']}:")
    print(f"  Recall@10: {metrics.recall_at_k:.3f}")
    print(f"  NDCG@10: {metrics.ndcg_at_k:.3f}")
    print(f"  MRR: {metrics.mrr:.3f}")
Example output:
semantic_search:
  Recall@10: 0.652
  NDCG@10: 0.701
  MRR: 0.745

hybrid_search:
  Recall@10: 0.712
  NDCG@10: 0.758
  MRR: 0.803

with_reranking:
  Recall@10: 0.718
  NDCG@10: 0.821
  MRR: 0.867

Interpreting metrics

Recall@k

  • High recall (greater than 0.8): System finds most relevant documents
  • Medium recall (0.5-0.8): System misses some relevant documents
  • Low recall (less than 0.5): System misses many relevant documents
Impact: Low recall means users may not see important information.

Precision@k

  • High precision (greater than 0.8): Most retrieved documents are relevant
  • Medium precision (0.5-0.8): Some irrelevant documents retrieved
  • Low precision (less than 0.5): Many irrelevant documents retrieved
Impact: Low precision means users see too much noise.

MRR

  • High MRR (greater than 0.8): First relevant result appears very early (positions 1-2)
  • Medium MRR (0.5-0.8): First relevant result around positions 2-4
  • Low MRR (less than 0.5): First relevant result appears late (position 5+)
Impact: Low MRR means users must scroll to find relevant content.

NDCG@k

  • High NDCG (greater than 0.8): Relevant documents ranked highly
  • Medium NDCG (0.5-0.8): Relevant documents have mixed rankings
  • Low NDCG (less than 0.5): Relevant documents ranked poorly
Impact: Low NDCG means ranking quality is poor even if recall is high.

Hit Rate

  • High hit rate (greater than 0.9): Almost all queries retrieve at least one relevant document
  • Medium hit rate (0.7-0.9): Most queries successful
  • Low hit rate (less than 0.7): Many queries fail to find any relevant document
Impact: Low hit rate means many queries return zero useful results.

Statistical significance

For robust comparisons, consider:
  • Sample size: Evaluate on at least 100 queries for stable metrics
  • Multiple runs: Run evaluations multiple times if randomness is involved
  • Variance: Report standard deviation or confidence intervals
  • Domain coverage: Ensure evaluation queries cover all use cases

Best practices

Use multiple metrics

Don’t rely on a single metric. Precision and recall trade off, so evaluate both.

Choose appropriate k

Match k to your application’s context window (e.g., k=5 for chat, k=20 for reranking)

Segment by difficulty

Analyze performance on easy vs hard queries separately for insights

Track over time

Monitor metrics across model/config changes to detect regressions

Integration with pipelines

Evaluation metrics integrate with feature modules:
# haystack/semantic_search/search/pinecone.py
from vectordb.utils.evaluation import QueryResult, evaluate_retrieval

query_results = []
for query in eval_queries:
    results = pipeline.run(query)
    query_results.append(
        QueryResult(
            query=query.query,
            retrieved_ids=[doc.id for doc in results["documents"]],
            retrieved_contents=[doc.content for doc in results["documents"]],
            relevant_ids=set(query.relevant_doc_ids),
            scores=[doc.score for doc in results["documents"]]
        )
    )

metrics = evaluate_retrieval(query_results, k=10)
print(f"Pipeline evaluation: {metrics.to_dict()}")
This consistent evaluation approach allows fair comparison across different backends, frameworks, and retrieval strategies.

Build docs developers (and LLMs) love