VectorStore class provides a persistent vector database for storing and querying document embeddings using ChromaDB. It handles collection management, document storage, and similarity search operations.
Class definition
Constructor parameters
Name of the ChromaDB collection to create or use. Typically formatted as the repository name with underscores (e.g.,
"facebook_react").Directory path where the ChromaDB database will be persisted. The database files are stored here for reuse across sessions.
The constructor automatically initializes the vector store by creating the persist directory if needed, connecting to ChromaDB, and setting up the collection with cosine similarity.
Methods
add_documents()
Adds documents and their embeddings to the vector store.List of document objects with
page_content and metadata attributes. Typically LangChain Document objects.NumPy array of embeddings with shape
(len(documents), embedding_dimension). Must have the same length as the documents list.The method automatically generates unique IDs for each document, extracts metadata, and converts NumPy embeddings to lists for ChromaDB compatibility. It also adds
doc_index and content_length to each document’s metadata._initialize_store()
Internal method that sets up the ChromaDB client and collection.- Creates the persist directory if it doesn’t exist
- Initializes a persistent ChromaDB client
- Deletes any existing collection with the same name (for fresh indexing)
- Creates a new collection with cosine similarity and metadata
- Reports the collection status
Collection metadata
Each collection is created with the following metadata:Document metadata
Each stored document includes:All metadata from the source document (e.g.,
path, source, branch)Sequential index of the document in the batch
Character count of the document content
Usage example
Integration example
Frommain.py showing the complete vector store setup:
Querying the collection
The collection can be queried directly using ChromaDB’s API:Collection management
Document ID format
Documents are assigned unique IDs in the format:Persistence and reuse
Error handling
Implementation notes
- Uses ChromaDB’s
PersistentClientfor disk-based storage - Collections are recreated on each initialization (deletes existing collection)
- Uses HNSW (Hierarchical Navigable Small World) index with cosine similarity
- Embeddings are converted from NumPy arrays to Python lists for ChromaDB compatibility
- All document content is stored alongside embeddings for retrieval
- Thread-safe for concurrent reads (write operations should be serialized)