Skip to main content
VectorDB is organized around a clear architectural separation between database backends, framework integrations, and shared utilities.

Package structure

The VectorDB package follows a modular architecture that separates concerns across five core modules:
src/vectordb/
├── databases/          # Backend wrappers (Chroma, Milvus, Pinecone, Qdrant, Weaviate)
├── dataloaders/        # Dataset loading, normalization, and evaluation query extraction
├── haystack/           # Haystack feature implementations
├── langchain/          # LangChain feature implementations
└── utils/              # Shared utilities (config, evaluation, sparse, ids, scope, logging)

Design principles

The toolkit is organized around three core ideas:

1. Backend wrappers

Normalize five vector databases (Pinecone, Weaviate, Chroma, Milvus, Qdrant) into a consistent interface. Each wrapper provides:
  • Lazy or eager initialization depending on backend characteristics
  • Unified document format conversion between framework and database formats
  • Feature parity across operations (create, upsert, query, delete)
  • Backend-specific optimizations (namespaces for Pinecone, tenants for Weaviate, partitions for Milvus)

2. Feature modules

Implement retrieval patterns for both Haystack and LangChain, each in its own directory:
  • semantic_search - Dense vector similarity search
  • hybrid_indexing - Dense + sparse (BM25/SPLADE) retrieval
  • metadata_filtering - Constraint-based filtering
  • reranking - Cross-encoder reranking for improved relevance
  • mmr - Maximal Marginal Relevance for diversity
  • diversity_filtering - Alternative diversity implementations
  • contextual_compression - Context trimming before generation
  • query_enhancement - Multi-query, HyDE, step-back prompting
  • agentic_rag - Iterative self-reflection and routing
  • parent_document_retrieval - Index small chunks, return large context
  • json_indexing - Structured document indexing
  • sparse_indexing - Pure lexical retrieval
  • multi_tenancy - Partition-based or tenant-based isolation
  • namespaces - Logical data segmentation
  • cost_optimized_rag - Token budget controls
Each feature module contains:
  • configs/ - Per-backend, per-dataset YAML configuration files
  • indexing/ - Scripts to load datasets and populate vector databases
  • search/ - Scripts to execute queries and compute evaluation metrics

3. Dataloaders and evaluation

Load standard QA benchmarks, convert them to framework documents, and compute retrieval metrics.
  • Catalog-based loader creation - DataloaderCatalog.create("triviaqa")
  • Normalized record format - All datasets produce DatasetRecord objects
  • Framework conversion - Automatic conversion to Haystack or LangChain documents
  • Standard metrics - Recall@k, Precision@k, MRR, NDCG@k, Hit Rate

Module boundaries

databases/

Shared vector database wrappers used by both Haystack and LangChain integrations. Purpose: Provide a consistent interface across different vector database backends, hiding vendor-specific API differences. Key files:
  • pinecone.py - Pinecone Cloud/Serverless (GRPC, namespace-based multi-tenancy)
  • weaviate.py - Weaviate Cloud/self-hosted (hybrid search, generative AI)
  • chroma.py - ChromaDB (local/HTTP, development-friendly)
  • milvus.py - Milvus/Zilliz (partition-key multi-tenancy, scalable)
  • qdrant.py - Qdrant (named vectors, quantization, payload filtering)

dataloaders/

Dataset normalization layer that loads benchmark datasets into a common record format. Purpose: Abstract away dataset-specific formats and provide consistent evaluation queries. Key modules:
  • catalog.py - Factory for creating dataset loaders
  • base.py - Abstract base class defining the loading contract
  • types.py - Shared types (DatasetRecord, EvaluationQuery)
  • converters.py - Convert normalized records to framework documents
  • datasets/ - Per-dataset loaders (TriviaQA, ARC, PopQA, FActScore, Earnings Calls)

haystack/

Complete Haystack implementations of all 17 retrieval feature patterns. Purpose: Provide production-ready Haystack pipelines with per-backend configs and scripts. Structure:
  • Each feature has configs/, indexing/, and search/ subdirectories
  • components/ - Reusable Haystack components (router, compressor, enhancer)
  • utils/ - Haystack-specific helpers (embeddings, reranker, fusion, RAG)

langchain/

Complete LangChain implementations of all 17 retrieval feature patterns, parallel in structure to Haystack. Purpose: Provide LangChain-native integrations with the same feature coverage as Haystack. Structure:
  • Mirrors haystack/ organization for consistency
  • components/ - Reusable LangChain components
  • utils/ - LangChain-specific helpers

utils/

Shared utilities used across all modules. Purpose: Centralize common functionality to avoid duplication. Key modules:
  • config_loader.py - YAML configuration with environment variable substitution
  • evaluation.py - Retrieval metrics (Recall@k, Precision@k, MRR, NDCG@k, Hit Rate)
  • sparse.py - Sparse embedding conversion for hybrid search
  • ids.py - Document ID management and deterministic generation
  • scope.py - Scope/namespace injection utilities
  • logging.py - Centralized logging configuration
  • Document converters for each backend (Pinecone, Weaviate, Chroma, Qdrant, Milvus)

Configuration format

All pipelines use YAML configuration files with environment variable substitution:
pinecone:
  api_key: "${PINECONE_API_KEY}"
  index_name: "my-index"

embeddings:
  model: "sentence-transformers/all-MiniLM-L6-v2"
  device: "cpu"
  batch_size: 32

dataloader:
  dataset: "triviaqa"
  split: "test"
  limit: 500

rag:
  enabled: true
  model: "llama-3.3-70b-versatile"
  api_key: "${GROQ_API_KEY}"
  temperature: 0.7
  max_tokens: 2048
Variables follow ${VAR} or ${VAR:-default} syntax.

Execution flow

A typical retrieval pipeline follows this flow:
  1. Configuration loading - Load YAML config with environment variable resolution
  2. Dataset loading - Use DataloaderCatalog to create and load dataset
  3. Document conversion - Convert DatasetRecord objects to framework documents
  4. Database initialization - Create vector database wrapper with config
  5. Index creation - Create collection/index with appropriate schema
  6. Document indexing - Upsert documents with embeddings (dense and/or sparse)
  7. Query execution - Run evaluation queries through the retrieval pipeline
  8. Metric computation - Compute Recall@k, Precision@k, MRR, NDCG@k, Hit Rate
  9. Result output - Display or save evaluation results

Cross-cutting concerns

Metadata flattening

Different vector databases have different metadata requirements:
  • Pinecone requires scalar values or string lists (nested dicts auto-flattened with underscores)
  • Weaviate uses schema properties (flexible types)
  • Chroma accepts nested dictionaries
  • Milvus uses typed fields in schema
  • Qdrant supports nested JSON payloads
Wrappers handle these differences transparently.

Sparse embedding normalization

Hybrid search implementations normalize sparse embeddings across backends:
  • Pinecone uses {"indices": [...], "values": [...]}
  • Weaviate uses native BM25 (no external sparse vectors)
  • Qdrant uses named sparse vectors
  • Milvus uses separate sparse vector field
The utils/sparse.py module provides conversion utilities.

Multi-tenancy patterns

Different backends implement isolation differently:
  • Pinecone - Namespaces (lightweight, per-query scoping)
  • Weaviate - Tenants (full isolation, separate shards)
  • Milvus - Partition keys (schema-level partitioning)
  • Qdrant - Collections or payload filtering
  • Chroma - Metadata filtering
Feature modules adapt to each backend’s native pattern.

Build docs developers (and LLMs) love