Skip to main content
Benchmarking allows you to measure and compare retrieval quality across different vector databases, embedding models, and pipeline configurations. VectorDB provides standardized evaluation utilities and datasets to support rigorous, reproducible benchmarking.

Evaluation metrics

VectorDB includes five core retrieval metrics:
MetricWhat it measuresWhen to optimize
Recall@kFraction of relevant documents retrieved in top-kWhen missing relevant documents is costly
Precision@kFraction of top-k documents that are relevantWhen irrelevant results harm user experience
MRRMean Reciprocal Rank of first relevant documentWhen users only look at top results
NDCG@kNormalized Discounted Cumulative GainWhen ranking order matters
Hit RatePercentage of queries with at least one relevant resultWhen recall is binary (found vs not found)

Metric formulas

# Recall@k = (relevant docs in top-k) / (total relevant docs)
Recall@k = |relevant ∩ retrieved_top_k| / |relevant|

# Precision@k = (relevant docs in top-k) / k
Precision@k = |relevant ∩ retrieved_top_k| / k

# MRR = mean(1 / rank_of_first_relevant)
MRR = mean(1 / rank_i)

# NDCG@k = DCG@k / IDCG@k
NDCG@k = Σ(rel_i / log2(i+2)) / IDCG@k

# Hit Rate = 1 if any relevant in top-k, else 0
Hit_Rate = 1 if |relevant ∩ top_k| > 0 else 0

Supported datasets

VectorDB includes loaders for five benchmark datasets:
Open-domain question-answer pairs for general knowledge retrieval.
dataloader:
  type: "triviaqa"
  split: "test"
  limit: 100
Use case: General knowledge QA systems, broad domain retrieval
Science reasoning questions requiring multi-hop inference.
dataloader:
  type: "arc"
  split: "test"
  limit: 200
Use case: Scientific and educational content retrieval
Factoid questions about popular entities.
dataloader:
  type: "popqa"
  split: "test"
  limit: 100
Use case: Entity-focused retrieval, celebrity and popular culture
Atomic facts for verification and hallucination detection.
dataloader:
  type: "factscore"
  split: "test"
  limit: 100
Use case: Fact verification, hallucination detection
Financial transcript Q&A for domain-specific RAG.
dataloader:
  type: "earnings_calls"
  split: "test"
  limit: 50
Use case: Financial domain, long-form transcripts

Running evaluations

Basic evaluation

Evaluate a single pipeline configuration:
from vectordb.utils.evaluation import evaluate_retrieval, QueryResult
from vectordb.dataloaders.evaluation import EvaluationExtractor
from vectordb.langchain.semantic_search import PineconeSemanticSearchPipeline

# Initialize pipeline
pipeline = PineconeSemanticSearchPipeline(
    "configs/pinecone_triviaqa.yaml"
)

# Load evaluation queries
records = pipeline.load_dataset()
evaluation_queries = EvaluationExtractor.extract(records, limit=100)

# Run evaluation
query_results = []
for eval_query in evaluation_queries:
    result = pipeline.search(eval_query.query, top_k=10)
    
    query_results.append(
        QueryResult(
            query=eval_query.query,
            retrieved_ids=[doc.id for doc in result["documents"]],
            relevant_ids=set(eval_query.relevant_doc_ids)
        )
    )

# Compute metrics
metrics = evaluate_retrieval(query_results, k=10)

print(f"Results for Pinecone on TriviaQA (k=10):")
print(f"  Recall@10:    {metrics.recall_at_k:.3f}")
print(f"  Precision@10: {metrics.precision_at_k:.3f}")
print(f"  MRR:          {metrics.mrr:.3f}")
print(f"  NDCG@10:      {metrics.ndcg_at_k:.3f}")
print(f"  Hit Rate:     {metrics.hit_rate:.3f}")
print(f"  Queries:      {metrics.num_queries}")

Cross-database comparison

Compare the same configuration across multiple databases:
from vectordb.utils.evaluation import evaluate_retrieval, EvaluationResult
import json

databases = [
    ("Pinecone", "configs/pinecone_triviaqa.yaml", PineconeSemanticSearchPipeline),
    ("Weaviate", "configs/weaviate_triviaqa.yaml", WeaviateSemanticSearchPipeline),
    ("Milvus", "configs/milvus_triviaqa.yaml", MilvusSemanticSearchPipeline),
    ("Qdrant", "configs/qdrant_triviaqa.yaml", QdrantSemanticSearchPipeline),
]

results = {}

for db_name, config_path, pipeline_class in databases:
    print(f"\nEvaluating {db_name}...")
    
    pipeline = pipeline_class(config_path)
    pipeline.index()
    
    query_results = []
    for eval_query in evaluation_queries:
        result = pipeline.search(eval_query.query, top_k=10)
        query_results.append(
            QueryResult(
                query=eval_query.query,
                retrieved_ids=[doc.id for doc in result["documents"]],
                relevant_ids=set(eval_query.relevant_doc_ids)
            )
        )
    
    metrics = evaluate_retrieval(query_results, k=10)
    results[db_name] = metrics.to_dict()
    
    print(f"  Recall@10: {metrics.recall_at_k:.3f}")
    print(f"  MRR:       {metrics.mrr:.3f}")

# Save results
with open("benchmark_results.json", "w") as f:
    json.dump(results, f, indent=2)

Comparing retrieval strategies

Benchmark different retrieval approaches:
configs = [
    ("Dense", "configs/semantic_search.yaml"),
    ("Sparse", "configs/sparse_search.yaml"),
    ("Hybrid", "configs/hybrid_search.yaml"),
    ("Hybrid + Reranking", "configs/hybrid_reranking.yaml"),
]

for strategy_name, config_path in configs:
    pipeline = PineconeSemanticSearchPipeline(config_path)
    # ... run evaluation
    print(f"{strategy_name}: Recall@10={metrics.recall_at_k:.3f}")

Evaluation with reranking metrics

When using reranking, track additional quality metrics:
reranker:
  type: "cross_encoder"
  model: "BAAI/bge-reranker-v2-m3"
  top_k: 5

evaluation:
  enabled: true
  metrics:
    - contextual_recall
    - contextual_precision
    - answer_relevancy
    - faithfulness
These metrics evaluate:
  • Contextual Recall: Do retrieved chunks contain information needed for the answer?
  • Contextual Precision: Are retrieved chunks relevant to the question?
  • Answer Relevancy: Does the generated answer address the question?
  • Faithfulness: Is the answer grounded in the retrieved context?

Cost-quality tradeoffs

Evaluate cost alongside quality for production deployments:
from vectordb.utils.evaluation import evaluate_retrieval
import time

candidate_pool_sizes = [5, 10, 15, 25, 50]
results = []

for pool_size in candidate_pool_sizes:
    config = load_config("config.yaml")
    config["search"]["candidate_pool_size"] = pool_size
    
    pipeline = initialize_pipeline(config)
    
    start_time = time.time()
    query_results = run_evaluation(pipeline, evaluation_queries)
    elapsed = time.time() - start_time
    
    metrics = evaluate_retrieval(query_results, k=10)
    
    results.append({
        "pool_size": pool_size,
        "recall": metrics.recall_at_k,
        "latency_ms": (elapsed / len(evaluation_queries)) * 1000,
        "estimated_cost": estimate_cost(pool_size, len(evaluation_queries))
    })

# Plot cost vs quality curve
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(
    [r["estimated_cost"] for r in results],
    [r["recall"] for r in results]
)
plt.xlabel("Estimated Cost per 1000 Queries")
plt.ylabel("Recall@10")
plt.title("Cost-Quality Tradeoff")
plt.savefig("cost_quality_curve.png")

Benchmark configuration best practices

Use the same evaluation queries across all runs:
# Save evaluation queries for reproducibility
import pickle

with open("eval_queries.pkl", "wb") as f:
    pickle.dump(evaluation_queries, f)

# Load in future runs
with open("eval_queries.pkl", "rb") as f:
    evaluation_queries = pickle.load(f)
Run warm-up queries before timing measurements:
# Warm up the pipeline
for _ in range(5):
    pipeline.search("warm up query", top_k=10)

# Now measure performance
start_time = time.time()
# ... run evaluation
Average metrics over multiple runs:
num_runs = 3
all_metrics = []

for run in range(num_runs):
    metrics = run_evaluation(pipeline, evaluation_queries)
    all_metrics.append(metrics)

# Compute average and standard deviation
avg_recall = np.mean([m.recall_at_k for m in all_metrics])
std_recall = np.std([m.recall_at_k for m in all_metrics])

print(f"Recall@10: {avg_recall:.3f} ± {std_recall:.3f}")
Set random seeds for reproducibility:
import random
import numpy as np
import torch

# Set seeds
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

# Now run evaluation

Interpreting results

When to optimize each metric

Recall

Optimize when missing relevant documents is costly. Medical diagnosis, legal research, and safety-critical applications.

Precision

Optimize when showing irrelevant results harms UX. Consumer search, recommendation systems.

MRR

Optimize when users only examine top results. Web search, autocomplete.

NDCG

Optimize when ranking quality matters more than binary relevance. E-commerce, content discovery.

Typical metric ranges

ConfigurationExpected Recall@10Expected MRR
Dense search (baseline)0.65-0.750.45-0.55
Sparse search (BM25)0.60-0.700.40-0.50
Hybrid search0.75-0.850.55-0.65
Hybrid + Reranking0.80-0.900.65-0.75
Agentic RAG0.85-0.950.70-0.80
These ranges assume well-tuned configurations on standard benchmarks like TriviaQA or ARC.

Advanced benchmarking

Per-query analysis

Identify queries where the pipeline struggles:
failed_queries = []

for query_result in query_results:
    recall = compute_recall_at_k(
        query_result.retrieved_ids,
        query_result.relevant_ids,
        k=10
    )
    
    if recall < 0.5:
        failed_queries.append({
            "query": query_result.query,
            "recall": recall,
            "retrieved": query_result.retrieved_ids[:3],
            "expected": list(query_result.relevant_ids)
        })

# Analyze failure patterns
print(f"Failed queries: {len(failed_queries)}")
for failure in failed_queries[:5]:
    print(f"\nQuery: {failure['query']}")
    print(f"Recall: {failure['recall']:.2f}")

Ablation studies

Measure the impact of individual components:
ablation_configs = [
    ("Baseline", {"reranking": False, "query_enhancement": False}),
    ("+ Reranking", {"reranking": True, "query_enhancement": False}),
    ("+ Query Enhancement", {"reranking": False, "query_enhancement": True}),
    ("+ Both", {"reranking": True, "query_enhancement": True}),
]

for name, overrides in ablation_configs:
    config = base_config.copy()
    config.update(overrides)
    
    metrics = run_evaluation(config)
    print(f"{name}: Recall@10={metrics.recall_at_k:.3f}")

Next steps

Configuration

Tune pipeline settings based on benchmark results

Production deployment

Deploy your best-performing configuration

Building RAG pipelines

Learn to build complete RAG systems

Environment variables

Configure benchmarking environments

Build docs developers (and LLMs) love