Overview
TheEmbeddingStore class manages storage, retrieval, and caching of embeddings for documents, entities, and facts. It provides efficient batch operations and persistent storage using pickle serialization.
Constructor
An embedding model instance that provides a
batch_encode() method for generating embeddings.Path to the directory where the embedding store will be saved. The actual file will be
{db_filename}/vdb_{namespace}.pkl.Batch size for encoding operations.
Namespace identifier for the store (e.g., “chunk”, “entity”, “fact”). Used for hash ID prefixing and file naming.
If True, ignores any existing stored data and starts fresh.
Example
Insertion Methods
insert_strings
Inserts text strings and generates embeddings for them.List of text strings to insert.
Whether to generate embeddings. If False, only stores the text with None embeddings.
- Automatically skips non-string values
- Deduplicates based on content hash
- Only encodes and stores new (unseen) texts
Example
insert_chunk_dicts
Inserts chunks with metadata and generates embeddings.List of chunk metadata dictionaries. Each should contain fields used by
make_chunk_content() to construct the text.Extraction method used (“openie”, “episodic”, “temporal”). Determines how chunk content is formatted.
Whether to generate embeddings.
For temporal method: whether to remove time qualifiers before embedding.
Dictionary mapping hash IDs to chunk data (content and metadata).
Example
Retrieval Methods
get_row
Retrieves a single row by hash ID.The hash ID of the row to retrieve.
Dictionary containing:
hash_id: The unique identifiercontent: The text contentmetadata: Additional metadata (if available)
Example
get_rows
Retrieves multiple rows by hash IDs (batch operation).List of hash IDs to retrieve.
Dictionary mapping hash IDs to row data dictionaries.
Example
get_all_ids
Returns all hash IDs in the store.List of all hash IDs in the store.
Example
get_text_for_all_rows
Returns a deep copy of all rows with their content.Deep copy of the hash_id_to_row dictionary. Safe to modify without affecting the store.
Use this when you need to modify the returned data. For read-only access, use
get_hash_id_to_row_readonly() for better performance.Example
get_hash_id_to_row_readonly
Returns a direct reference to the internal hash_id_to_row dictionary.Direct reference to the internal dictionary. Do not modify!
Example
Embedding Retrieval Methods
get_embedding
Retrieves embedding for a single hash ID.Hash ID of the item.
Data type for the returned embedding array.
1D numpy array containing the embedding vector.
Example
get_embeddings
Retrieves embeddings for multiple hash IDs (batch operation).List of hash IDs.
Data type for the returned embedding arrays.
List of embedding arrays, one for each hash ID.
Example
Storage Management
Internal Methods
The following methods are used internally and typically don’t need to be called directly:_load_data
Loads stored embeddings from disk._save_data
Saves current embeddings to disk.This is called automatically after insertions. Manual calls are typically not needed.
_upsert
Inserts or updates embeddings in the store.Properties and Attributes
embeddings
hash_ids
texts
metadata
Error Handling & Validation
Embedding Validation
TheEmbeddingStore automatically validates and fixes problematic embeddings:
- None embeddings: Replaced with zero vectors
- NaN/Inf values: Replaced with zero vectors
- Empty arrays: Replaced with zero vectors
Complete Example
Storage Format
The EmbeddingStore saves data as a pickle file with the following structure:{db_filename}/vdb_{namespace}.pkl
Example file paths:
- Chunks:
./embeddings/chunks/vdb_chunk.pkl - Entities:
./embeddings/entities/vdb_entity.pkl - Facts:
./embeddings/facts/vdb_fact.pkl
Best Practices
Batch Operations
Always use batch methods (
get_rows, get_embeddings) instead of loops for better performance.Read-Only Access
Use
get_hash_id_to_row_readonly() for read-only operations to avoid unnecessary copying.Incremental Updates
The store automatically handles incremental updates - only new items are encoded.
Namespace Organization
Use clear namespaces: “chunk”, “entity”, “fact”, “gist”, “verbatim” for different content types.
Performance Tips
- Batch size: Tune
batch_sizebased on your GPU memory - Deduplication: The store automatically deduplicates, so don’t pre-filter
- Memory usage: Large stores load all embeddings into memory - monitor RAM usage
- Persistence: Data is automatically saved after insertions