Vector store

The VectorStore class provides a persistent vector database for storing and querying document embeddings using ChromaDB. It handles collection management, document storage, and similarity search operations.

Class definition

class VectorStore:
    def __init__(self, collection_name, persist_directory)

Constructor parameters

collection_name

str

required

Name of the ChromaDB collection to create or use. Typically formatted as the repository name with underscores (e.g., "facebook_react").

persist_directory

str

required

Directory path where the ChromaDB database will be persisted. The database files are stored here for reuse across sessions.

The constructor automatically initializes the vector store by creating the persist directory if needed, connecting to ChromaDB, and setting up the collection with cosine similarity.

Methods

add_documents()

Adds documents and their embeddings to the vector store.

def add_documents(self, documents: List[Any], embeddings: numpy.ndarray) -> None

documents

List[Any]

required

List of document objects with page_content and metadata attributes. Typically LangChain Document objects.

embeddings

numpy.ndarray

required

NumPy array of embeddings with shape (len(documents), embedding_dimension). Must have the same length as the documents list.

The method automatically generates unique IDs for each document, extracts metadata, and converts NumPy embeddings to lists for ChromaDB compatibility. It also adds doc_index and content_length to each document’s metadata.

_initialize_store()

Internal method that sets up the ChromaDB client and collection.

def _initialize_store(self) -> None

This method:

Creates the persist directory if it doesn’t exist
Initializes a persistent ChromaDB client
Deletes any existing collection with the same name (for fresh indexing)
Creates a new collection with cosine similarity and metadata
Reports the collection status

Collection metadata

Each collection is created with the following metadata:

{
    "hnsw:space": "cosine",           # Cosine similarity for semantic search
    "repo": collection_name,           # Repository identifier
    "type": "github_codebase",         # Collection type
    "embedding_model": "all-MiniLM-L6-v2"  # Model used for embeddings
}

Document metadata

Each stored document includes:

Original metadata

dict

All metadata from the source document (e.g., path, source, branch)

doc_index

int

Sequential index of the document in the batch

content_length

int

Character count of the document content

Usage example

from src.rag.vector_store import VectorStore
from pathlib import Path

# Initialize vector store
persist_directory = Path.home() / ".RepoRAGX" / "vector_store"
vector_store = VectorStore(
    collection_name="facebook_react",
    persist_directory=str(persist_directory)
)

# Add documents and embeddings
vector_store.add_documents(chunks, embeddings)

print(f"Total documents in collection: {vector_store.collection.count()}")

Integration example

From main.py showing the complete vector store setup:

from pathlib import Path
import os

# Setup persist directory
persist_directory = Path.home() / ".RepoRAGX" / "vector_store"
os.makedirs(persist_directory, exist_ok=True)

# Initialize vector store with repository name
vector_store = VectorStore(
    collection_name=repo.replace("/", "_"),  # e.g., "facebook_react"
    persist_directory=persist_directory
)

# Generate and store embeddings
texts = [doc.page_content for doc in chunks]
embeddings = embedding_manager.generate_embeddings(texts)
vector_store.add_documents(chunks, embeddings)

Querying the collection

The collection can be queried directly using ChromaDB’s API:

# Query for similar documents
results = vector_store.collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5
)

# Access results
documents = results['documents'][0]
metadatas = results['metadatas'][0]
distances = results['distances'][0]
ids = results['ids'][0]

Collection management

count = vector_store.collection.count()
print(f"Documents in collection: {count}")

Document ID format

Documents are assigned unique IDs in the format:

f"doc_{uuid.uuid4().hex[:8]}_{i}"
# Example: "doc_a3f2b9c1_0", "doc_7e4d8a2f_1"

This ensures uniqueness even when adding documents in multiple batches.

Persistence and reuse

# First run: Create and populate collection
vector_store = VectorStore(
    collection_name="my_repo",
    persist_directory="./vector_db"
)
vector_store.add_documents(chunks, embeddings)

# Later: Reuse existing collection
# Note: Current implementation always creates fresh collection
# Modify _initialize_store() to skip deletion for reuse

Error handling

try:
    vector_store.add_documents(documents, embeddings)
except ValueError as e:
    if "must match" in str(e):
        print("Document and embedding counts don't match")
except Exception as e:
    print(f"Error adding documents: {e}")

Implementation notes

Uses ChromaDB’s PersistentClient for disk-based storage
Collections are recreated on each initialization (deletes existing collection)
Uses HNSW (Hierarchical Navigable Small World) index with cosine similarity
Embeddings are converted from NumPy arrays to Python lists for ChromaDB compatibility
All document content is stored alongside embeddings for retrieval
Thread-safe for concurrent reads (write operations should be serialized)

Core Modules

Class definition

Constructor parameters

Methods

add_documents()

_initialize_store()

Collection metadata

Document metadata

Usage example

Integration example

Querying the collection

Collection management

Document ID format

Persistence and reuse

Error handling

Implementation notes

Build docs developers (and LLMs) love

Core Modules

​Class definition

​Constructor parameters

​Methods

​add_documents()

​_initialize_store()

​Collection metadata

​Document metadata

​Usage example

​Integration example

​Querying the collection

​Collection management

​Document ID format

​Persistence and reuse

​Error handling

​Implementation notes

Build docs developers (and LLMs) love

Class definition

Constructor parameters

Methods

add_documents()

_initialize_store()

Collection metadata

Document metadata

Usage example

Integration example

Querying the collection

Collection management

Document ID format

Persistence and reuse

Error handling

Implementation notes