Skip to main content

Overview

The EmbeddingStore class manages storage, retrieval, and caching of embeddings for documents, entities, and facts. It provides efficient batch operations and persistent storage using pickle serialization.

Constructor

EmbeddingStore(
    embedding_model,
    db_filename: str,
    batch_size: int,
    namespace: str,
    from_scratch: bool = False
)
embedding_model
BaseEmbeddingModel
required
An embedding model instance that provides a batch_encode() method for generating embeddings.
db_filename
str
required
Path to the directory where the embedding store will be saved. The actual file will be {db_filename}/vdb_{namespace}.pkl.
batch_size
int
required
Batch size for encoding operations.
namespace
str
required
Namespace identifier for the store (e.g., “chunk”, “entity”, “fact”). Used for hash ID prefixing and file naming.
from_scratch
bool
default:"False"
If True, ignores any existing stored data and starts fresh.

Example

from remem.embedding_store import EmbeddingStore
from remem.embedding_model import get_embedding_client

# Initialize embedding model
embedding_model = get_embedding_client("nvidia/NV-Embed-v2")()

# Create embedding store
store = EmbeddingStore(
    embedding_model=embedding_model,
    db_filename="./embeddings/chunks",
    batch_size=16,
    namespace="chunk",
    from_scratch=False
)

Insertion Methods

insert_strings

Inserts text strings and generates embeddings for them.
insert_strings(
    texts: List[str],
    embed: bool = True
) -> None
texts
List[str]
required
List of text strings to insert.
embed
bool
default:"True"
Whether to generate embeddings. If False, only stores the text with None embeddings.
  • Automatically skips non-string values
  • Deduplicates based on content hash
  • Only encodes and stores new (unseen) texts

Example

texts = [
    "Paris is the capital of France.",
    "London is the capital of the UK.",
    "Berlin is the capital of Germany."
]

store.insert_strings(texts)
print(f"Total embeddings: {len(store.embeddings)}")

insert_chunk_dicts

Inserts chunks with metadata and generates embeddings.
insert_chunk_dicts(
    chunk_meta: List[Dict],
    extract_method: Optional[str] = None,
    embed: bool = True,
    remove_qualifiers: bool = True
) -> Dict
chunk_meta
List[Dict]
required
List of chunk metadata dictionaries. Each should contain fields used by make_chunk_content() to construct the text.
extract_method
str
default:"None"
Extraction method used (“openie”, “episodic”, “temporal”). Determines how chunk content is formatted.
embed
bool
default:"True"
Whether to generate embeddings.
remove_qualifiers
bool
default:"True"
For temporal method: whether to remove time qualifiers before embedding.
return
Dict
Dictionary mapping hash IDs to chunk data (content and metadata).

Example

chunk_meta = [
    {
        "content": "Paris is the capital of France.",
        "date": "2024-01-15",
        "role": "user"
    },
    {
        "content": "It is known for the Eiffel Tower.",
        "date": "2024-01-15",
        "role": "assistant"
    }
]

nodes_dict = store.insert_chunk_dicts(
    chunk_meta,
    extract_method="episodic"
)

print(f"Inserted {len(nodes_dict)} chunks")

Retrieval Methods

get_row

Retrieves a single row by hash ID.
get_row(hash_id: str) -> Dict
hash_id
str
required
The hash ID of the row to retrieve.
return
Dict
Dictionary containing:
  • hash_id: The unique identifier
  • content: The text content
  • metadata: Additional metadata (if available)

Example

row = store.get_row("chunk-abc123")
print(row["content"])
print(row.get("metadata", {}))

get_rows

Retrieves multiple rows by hash IDs (batch operation).
get_rows(hash_ids: List[str]) -> Dict[str, Dict]
hash_ids
List[str]
required
List of hash IDs to retrieve.
return
Dict[str, Dict]
Dictionary mapping hash IDs to row data dictionaries.

Example

hash_ids = ["chunk-abc123", "chunk-def456"]
rows = store.get_rows(hash_ids)

for hash_id, row in rows.items():
    print(f"{hash_id}: {row['content'][:50]}...")

get_all_ids

Returns all hash IDs in the store.
get_all_ids() -> List[str]
return
List[str]
List of all hash IDs in the store.

Example

all_ids = store.get_all_ids()
print(f"Total items: {len(all_ids)}")
print(f"First ID: {all_ids[0]}")

get_text_for_all_rows

Returns a deep copy of all rows with their content.
get_text_for_all_rows() -> Dict[str, Dict]
return
Dict[str, Dict]
Deep copy of the hash_id_to_row dictionary. Safe to modify without affecting the store.
Use this when you need to modify the returned data. For read-only access, use get_hash_id_to_row_readonly() for better performance.

Example

all_rows = store.get_text_for_all_rows()
for hash_id, row in all_rows.items():
    row["content"] = row["content"].upper()  # Safe to modify

get_hash_id_to_row_readonly

Returns a direct reference to the internal hash_id_to_row dictionary.
get_hash_id_to_row_readonly() -> Dict[str, Dict]
return
Dict[str, Dict]
Direct reference to the internal dictionary. Do not modify!
This returns a reference to the internal data structure. Do not modify the returned dictionary or its contents! Use get_text_for_all_rows() if you need to modify the data.

Example

# Read-only access (more efficient)
rows_ref = store.get_hash_id_to_row_readonly()
count = len(rows_ref)
print(f"Total rows: {count}")

# Don't do this!
# rows_ref["some_id"]["content"] = "modified"  # BAD!

Embedding Retrieval Methods

get_embedding

Retrieves embedding for a single hash ID.
get_embedding(
    hash_id: str,
    dtype = np.float32
) -> np.ndarray
hash_id
str
required
Hash ID of the item.
dtype
numpy dtype
default:"np.float32"
Data type for the returned embedding array.
return
np.ndarray
1D numpy array containing the embedding vector.

Example

import numpy as np

emb = store.get_embedding("chunk-abc123")
print(f"Embedding shape: {emb.shape}")
print(f"Embedding norm: {np.linalg.norm(emb)}")

get_embeddings

Retrieves embeddings for multiple hash IDs (batch operation).
get_embeddings(
    hash_ids: List[str],
    dtype = np.float32
) -> List[np.ndarray]
hash_ids
List[str]
required
List of hash IDs.
dtype
numpy dtype
default:"np.float32"
Data type for the returned embedding arrays.
return
List[np.ndarray]
List of embedding arrays, one for each hash ID.

Example

import numpy as np

hash_ids = ["chunk-abc123", "chunk-def456", "chunk-ghi789"]
embs = store.get_embeddings(hash_ids)

embs_array = np.array(embs)
print(f"Embeddings shape: {embs_array.shape}")

# Compute similarity
sim = np.dot(embs[0], embs[1])
print(f"Similarity: {sim}")

Storage Management

Internal Methods

The following methods are used internally and typically don’t need to be called directly:

_load_data

Loads stored embeddings from disk.
_load_data(from_scratch: bool) -> None

_save_data

Saves current embeddings to disk.
_save_data() -> None
This is called automatically after insertions. Manual calls are typically not needed.

_upsert

Inserts or updates embeddings in the store.
_upsert(
    hash_ids: List[str],
    texts: List[str],
    embeddings: List[np.ndarray],
    metadata: Optional[List[Dict]] = None
) -> None

Properties and Attributes

embeddings

store.embeddings  # List[np.ndarray]
Direct access to the list of all embeddings.

hash_ids

store.hash_ids  # List[str]
Direct access to the list of all hash IDs.

texts

store.texts  # List[str]
Direct access to the list of all text contents.

metadata

store.metadata  # List[Dict]
Direct access to the list of all metadata dictionaries.

Error Handling & Validation

Embedding Validation

The EmbeddingStore automatically validates and fixes problematic embeddings:
  • None embeddings: Replaced with zero vectors
  • NaN/Inf values: Replaced with zero vectors
  • Empty arrays: Replaced with zero vectors
# This is handled automatically
texts = ["Valid text", None, "Another valid text"]
store.insert_strings([t for t in texts if t is not None])
Corrupted embeddings are automatically replaced with zero vectors and logged. Check logs for warnings about replaced embeddings.

Complete Example

from remem.embedding_store import EmbeddingStore
from remem.embedding_model import _get_embedding_client
import numpy as np

# Initialize
embedding_model = _get_embedding_client("nvidia/NV-Embed-v2")()

chunk_store = EmbeddingStore(
    embedding_model=embedding_model,
    db_filename="./data/chunk_embeddings",
    batch_size=32,
    namespace="chunk",
    from_scratch=False
)

# Insert documents
docs = [
    "Paris is the capital of France.",
    "London is the capital of the UK.",
    "Berlin is the capital of Germany."
]

chunk_store.insert_strings(docs)

# Get all IDs
all_ids = chunk_store.get_all_ids()
print(f"Stored {len(all_ids)} documents")

# Retrieve specific documents
rows = chunk_store.get_rows(all_ids[:2])
for hash_id, row in rows.items():
    print(f"Document: {row['content']}")

# Get embeddings
embs = chunk_store.get_embeddings(all_ids)
embs_array = np.array(embs)

# Compute similarity matrix
sim_matrix = np.dot(embs_array, embs_array.T)
print(f"Similarity matrix shape: {sim_matrix.shape}")

# Find most similar to first document
first_doc_sims = sim_matrix[0]
top_idx = np.argsort(first_doc_sims)[::-1][1]  # Exclude self
print(f"Most similar to '{docs[0]}':")
print(f"  {docs[top_idx]} (similarity: {first_doc_sims[top_idx]:.3f})")

Storage Format

The EmbeddingStore saves data as a pickle file with the following structure:
{
    "hash_ids": List[str],
    "texts": List[str],
    "embeddings": List[np.ndarray],
    "metadata": List[Dict]  # Optional
}
File location: {db_filename}/vdb_{namespace}.pkl

Example file paths:

  • Chunks: ./embeddings/chunks/vdb_chunk.pkl
  • Entities: ./embeddings/entities/vdb_entity.pkl
  • Facts: ./embeddings/facts/vdb_fact.pkl

Best Practices

Batch Operations

Always use batch methods (get_rows, get_embeddings) instead of loops for better performance.

Read-Only Access

Use get_hash_id_to_row_readonly() for read-only operations to avoid unnecessary copying.

Incremental Updates

The store automatically handles incremental updates - only new items are encoded.

Namespace Organization

Use clear namespaces: “chunk”, “entity”, “fact”, “gist”, “verbatim” for different content types.

Performance Tips

  1. Batch size: Tune batch_size based on your GPU memory
  2. Deduplication: The store automatically deduplicates, so don’t pre-filter
  3. Memory usage: Large stores load all embeddings into memory - monitor RAM usage
  4. Persistence: Data is automatically saved after insertions

Build docs developers (and LLMs) love