Package structure
The VectorDB package follows a modular architecture that separates concerns across five core modules:Design principles
The toolkit is organized around three core ideas:1. Backend wrappers
Normalize five vector databases (Pinecone, Weaviate, Chroma, Milvus, Qdrant) into a consistent interface. Each wrapper provides:- Lazy or eager initialization depending on backend characteristics
- Unified document format conversion between framework and database formats
- Feature parity across operations (create, upsert, query, delete)
- Backend-specific optimizations (namespaces for Pinecone, tenants for Weaviate, partitions for Milvus)
2. Feature modules
Implement retrieval patterns for both Haystack and LangChain, each in its own directory:semantic_search- Dense vector similarity searchhybrid_indexing- Dense + sparse (BM25/SPLADE) retrievalmetadata_filtering- Constraint-based filteringreranking- Cross-encoder reranking for improved relevancemmr- Maximal Marginal Relevance for diversitydiversity_filtering- Alternative diversity implementationscontextual_compression- Context trimming before generationquery_enhancement- Multi-query, HyDE, step-back promptingagentic_rag- Iterative self-reflection and routingparent_document_retrieval- Index small chunks, return large contextjson_indexing- Structured document indexingsparse_indexing- Pure lexical retrievalmulti_tenancy- Partition-based or tenant-based isolationnamespaces- Logical data segmentationcost_optimized_rag- Token budget controls
configs/- Per-backend, per-dataset YAML configuration filesindexing/- Scripts to load datasets and populate vector databasessearch/- Scripts to execute queries and compute evaluation metrics
3. Dataloaders and evaluation
Load standard QA benchmarks, convert them to framework documents, and compute retrieval metrics.- Catalog-based loader creation -
DataloaderCatalog.create("triviaqa") - Normalized record format - All datasets produce
DatasetRecordobjects - Framework conversion - Automatic conversion to Haystack or LangChain documents
- Standard metrics - Recall@k, Precision@k, MRR, NDCG@k, Hit Rate
Module boundaries
databases/
Shared vector database wrappers used by both Haystack and LangChain integrations. Purpose: Provide a consistent interface across different vector database backends, hiding vendor-specific API differences. Key files:pinecone.py- Pinecone Cloud/Serverless (GRPC, namespace-based multi-tenancy)weaviate.py- Weaviate Cloud/self-hosted (hybrid search, generative AI)chroma.py- ChromaDB (local/HTTP, development-friendly)milvus.py- Milvus/Zilliz (partition-key multi-tenancy, scalable)qdrant.py- Qdrant (named vectors, quantization, payload filtering)
dataloaders/
Dataset normalization layer that loads benchmark datasets into a common record format. Purpose: Abstract away dataset-specific formats and provide consistent evaluation queries. Key modules:catalog.py- Factory for creating dataset loadersbase.py- Abstract base class defining the loading contracttypes.py- Shared types (DatasetRecord,EvaluationQuery)converters.py- Convert normalized records to framework documentsdatasets/- Per-dataset loaders (TriviaQA, ARC, PopQA, FActScore, Earnings Calls)
haystack/
Complete Haystack implementations of all 17 retrieval feature patterns. Purpose: Provide production-ready Haystack pipelines with per-backend configs and scripts. Structure:- Each feature has
configs/,indexing/, andsearch/subdirectories components/- Reusable Haystack components (router, compressor, enhancer)utils/- Haystack-specific helpers (embeddings, reranker, fusion, RAG)
langchain/
Complete LangChain implementations of all 17 retrieval feature patterns, parallel in structure to Haystack. Purpose: Provide LangChain-native integrations with the same feature coverage as Haystack. Structure:- Mirrors
haystack/organization for consistency components/- Reusable LangChain componentsutils/- LangChain-specific helpers
utils/
Shared utilities used across all modules. Purpose: Centralize common functionality to avoid duplication. Key modules:config_loader.py- YAML configuration with environment variable substitutionevaluation.py- Retrieval metrics (Recall@k, Precision@k, MRR, NDCG@k, Hit Rate)sparse.py- Sparse embedding conversion for hybrid searchids.py- Document ID management and deterministic generationscope.py- Scope/namespace injection utilitieslogging.py- Centralized logging configuration- Document converters for each backend (Pinecone, Weaviate, Chroma, Qdrant, Milvus)
Configuration format
All pipelines use YAML configuration files with environment variable substitution:${VAR} or ${VAR:-default} syntax.
Execution flow
A typical retrieval pipeline follows this flow:- Configuration loading - Load YAML config with environment variable resolution
- Dataset loading - Use
DataloaderCatalogto create and load dataset - Document conversion - Convert
DatasetRecordobjects to framework documents - Database initialization - Create vector database wrapper with config
- Index creation - Create collection/index with appropriate schema
- Document indexing - Upsert documents with embeddings (dense and/or sparse)
- Query execution - Run evaluation queries through the retrieval pipeline
- Metric computation - Compute Recall@k, Precision@k, MRR, NDCG@k, Hit Rate
- Result output - Display or save evaluation results
Cross-cutting concerns
Metadata flattening
Different vector databases have different metadata requirements:- Pinecone requires scalar values or string lists (nested dicts auto-flattened with underscores)
- Weaviate uses schema properties (flexible types)
- Chroma accepts nested dictionaries
- Milvus uses typed fields in schema
- Qdrant supports nested JSON payloads
Sparse embedding normalization
Hybrid search implementations normalize sparse embeddings across backends:- Pinecone uses
{"indices": [...], "values": [...]} - Weaviate uses native BM25 (no external sparse vectors)
- Qdrant uses named sparse vectors
- Milvus uses separate sparse vector field
utils/sparse.py module provides conversion utilities.
Multi-tenancy patterns
Different backends implement isolation differently:- Pinecone - Namespaces (lightweight, per-query scoping)
- Weaviate - Tenants (full isolation, separate shards)
- Milvus - Partition keys (schema-level partitioning)
- Qdrant - Collections or payload filtering
- Chroma - Metadata filtering