Evaluation metrics
VectorDB includes five core retrieval metrics:| Metric | What it measures | When to optimize |
|---|---|---|
| Recall@k | Fraction of relevant documents retrieved in top-k | When missing relevant documents is costly |
| Precision@k | Fraction of top-k documents that are relevant | When irrelevant results harm user experience |
| MRR | Mean Reciprocal Rank of first relevant document | When users only look at top results |
| NDCG@k | Normalized Discounted Cumulative Gain | When ranking order matters |
| Hit Rate | Percentage of queries with at least one relevant result | When recall is binary (found vs not found) |
Metric formulas
Supported datasets
VectorDB includes loaders for five benchmark datasets:TriviaQA
TriviaQA
Open-domain question-answer pairs for general knowledge retrieval.Use case: General knowledge QA systems, broad domain retrieval
ARC (AI2 Reasoning Challenge)
ARC (AI2 Reasoning Challenge)
Science reasoning questions requiring multi-hop inference.Use case: Scientific and educational content retrieval
PopQA
PopQA
Factoid questions about popular entities.Use case: Entity-focused retrieval, celebrity and popular culture
FactScore
FactScore
Atomic facts for verification and hallucination detection.Use case: Fact verification, hallucination detection
Earnings Calls
Earnings Calls
Financial transcript Q&A for domain-specific RAG.Use case: Financial domain, long-form transcripts
Running evaluations
Basic evaluation
Evaluate a single pipeline configuration:Cross-database comparison
Compare the same configuration across multiple databases:Comparing retrieval strategies
Benchmark different retrieval approaches:Evaluation with reranking metrics
When using reranking, track additional quality metrics:- Contextual Recall: Do retrieved chunks contain information needed for the answer?
- Contextual Precision: Are retrieved chunks relevant to the question?
- Answer Relevancy: Does the generated answer address the question?
- Faithfulness: Is the answer grounded in the retrieved context?
Cost-quality tradeoffs
Evaluate cost alongside quality for production deployments:Benchmark configuration best practices
Consistent evaluation sets
Consistent evaluation sets
Use the same evaluation queries across all runs:
Warm-up queries
Warm-up queries
Run warm-up queries before timing measurements:
Multiple runs for stability
Multiple runs for stability
Average metrics over multiple runs:
Control for randomness
Control for randomness
Set random seeds for reproducibility:
Interpreting results
When to optimize each metric
Recall
Optimize when missing relevant documents is costly. Medical diagnosis, legal research, and safety-critical applications.
Precision
Optimize when showing irrelevant results harms UX. Consumer search, recommendation systems.
MRR
Optimize when users only examine top results. Web search, autocomplete.
NDCG
Optimize when ranking quality matters more than binary relevance. E-commerce, content discovery.
Typical metric ranges
| Configuration | Expected Recall@10 | Expected MRR |
|---|---|---|
| Dense search (baseline) | 0.65-0.75 | 0.45-0.55 |
| Sparse search (BM25) | 0.60-0.70 | 0.40-0.50 |
| Hybrid search | 0.75-0.85 | 0.55-0.65 |
| Hybrid + Reranking | 0.80-0.90 | 0.65-0.75 |
| Agentic RAG | 0.85-0.95 | 0.70-0.80 |
Advanced benchmarking
Per-query analysis
Identify queries where the pipeline struggles:Ablation studies
Measure the impact of individual components:Next steps
Configuration
Tune pipeline settings based on benchmark results
Production deployment
Deploy your best-performing configuration
Building RAG pipelines
Learn to build complete RAG systems
Environment variables
Configure benchmarking environments