Overview
VectorDB implements five core retrieval metrics used in academic and industry benchmarking:- Recall@k - Fraction of relevant documents retrieved in top-k results
- Precision@k - Fraction of top-k results that are relevant
- MRR (Mean Reciprocal Rank) - Average of reciprocal ranks of first relevant document
- NDCG@k (Normalized DCG) - Rank-aware metric normalized by ideal ranking
- Hit Rate - Binary indicator if any relevant document appears in top-k
Metric formulas
Recall@k
Measures the proportion of relevant documents that were successfully retrieved:Precision@k
Measures the proportion of retrieved documents that are actually relevant:MRR (Mean Reciprocal Rank)
Measures how quickly the first relevant document appears:NDCG@k (Normalized Discounted Cumulative Gain)
Rank-aware metric that gives more weight to relevant documents appearing earlier:log2(i + 2) because log2(1) = 0
Hit Rate
Binary success metric indicating whether any relevant document was retrieved:Data structures
RetrievalMetrics
Container for aggregated evaluation metrics:QueryResult
Result for a single query evaluation:EvaluationResult
Complete evaluation result for a retrieval pipeline:Computing metrics
Single-query metrics
Compute metrics for individual queries:Aggregated metrics
Compute metrics across multiple queries:Evaluation workflow
Typical evaluation pipeline:Choosing k values
The cutoffk determines how many top results are considered:
k=5
Use case: Strict precision requirements
Scenario: Chat interfaces with limited context window
Scenario: Chat interfaces with limited context window
k=10
Use case: Balanced evaluation
Scenario: Standard RAG pipelines with reranking
Scenario: Standard RAG pipelines with reranking
k=20
Use case: Recall-focused evaluation
Scenario: Multi-stage retrieval (retrieve many, rerank to few)
Scenario: Multi-stage retrieval (retrieve many, rerank to few)
k=100
Use case: First-stage retrieval quality
Scenario: Evaluating retriever before compression/filtering
Scenario: Evaluating retriever before compression/filtering
Binary vs graded relevance
VectorDB metrics assume binary relevance (document is relevant or not):Comparing pipelines
Evaluate multiple retrieval strategies:Interpreting metrics
Recall@k
- High recall (greater than 0.8): System finds most relevant documents
- Medium recall (0.5-0.8): System misses some relevant documents
- Low recall (less than 0.5): System misses many relevant documents
Precision@k
- High precision (greater than 0.8): Most retrieved documents are relevant
- Medium precision (0.5-0.8): Some irrelevant documents retrieved
- Low precision (less than 0.5): Many irrelevant documents retrieved
MRR
- High MRR (greater than 0.8): First relevant result appears very early (positions 1-2)
- Medium MRR (0.5-0.8): First relevant result around positions 2-4
- Low MRR (less than 0.5): First relevant result appears late (position 5+)
NDCG@k
- High NDCG (greater than 0.8): Relevant documents ranked highly
- Medium NDCG (0.5-0.8): Relevant documents have mixed rankings
- Low NDCG (less than 0.5): Relevant documents ranked poorly
Hit Rate
- High hit rate (greater than 0.9): Almost all queries retrieve at least one relevant document
- Medium hit rate (0.7-0.9): Most queries successful
- Low hit rate (less than 0.7): Many queries fail to find any relevant document
Statistical significance
For robust comparisons, consider:- Sample size: Evaluate on at least 100 queries for stable metrics
- Multiple runs: Run evaluations multiple times if randomness is involved
- Variance: Report standard deviation or confidence intervals
- Domain coverage: Ensure evaluation queries cover all use cases
Best practices
Use multiple metrics
Don’t rely on a single metric. Precision and recall trade off, so evaluate both.
Choose appropriate k
Match k to your application’s context window (e.g., k=5 for chat, k=20 for reranking)
Segment by difficulty
Analyze performance on easy vs hard queries separately for insights
Track over time
Monitor metrics across model/config changes to detect regressions