EmbeddingStore

Overview

The EmbeddingStore class manages storage, retrieval, and caching of embeddings for documents, entities, and facts. It provides efficient batch operations and persistent storage using pickle serialization.

Constructor

EmbeddingStore(
    embedding_model,
    db_filename: str,
    batch_size: int,
    namespace: str,
    from_scratch: bool = False
)

embedding_model

BaseEmbeddingModel

required

An embedding model instance that provides a batch_encode() method for generating embeddings.

db_filename

str

required

Path to the directory where the embedding store will be saved. The actual file will be {db_filename}/vdb_{namespace}.pkl.

batch_size

int

required

Batch size for encoding operations.

namespace

str

required

Namespace identifier for the store (e.g., “chunk”, “entity”, “fact”). Used for hash ID prefixing and file naming.

from_scratch

bool

default:"False"

If True, ignores any existing stored data and starts fresh.

Example

from remem.embedding_store import EmbeddingStore
from remem.embedding_model import get_embedding_client

# Initialize embedding model
embedding_model = get_embedding_client("nvidia/NV-Embed-v2")()

# Create embedding store
store = EmbeddingStore(
    embedding_model=embedding_model,
    db_filename="./embeddings/chunks",
    batch_size=16,
    namespace="chunk",
    from_scratch=False
)

Insertion Methods

insert_strings

Inserts text strings and generates embeddings for them.

insert_strings(
    texts: List[str],
    embed: bool = True
) -> None

texts

List[str]

required

List of text strings to insert.

embed

bool

default:"True"

Whether to generate embeddings. If False, only stores the text with None embeddings.

Automatically skips non-string values
Deduplicates based on content hash
Only encodes and stores new (unseen) texts

Example

texts = [
    "Paris is the capital of France.",
    "London is the capital of the UK.",
    "Berlin is the capital of Germany."
]

store.insert_strings(texts)
print(f"Total embeddings: {len(store.embeddings)}")

insert_chunk_dicts

Inserts chunks with metadata and generates embeddings.

insert_chunk_dicts(
    chunk_meta: List[Dict],
    extract_method: Optional[str] = None,
    embed: bool = True,
    remove_qualifiers: bool = True
) -> Dict

chunk_meta

List[Dict]

required

List of chunk metadata dictionaries. Each should contain fields used by make_chunk_content() to construct the text.

extract_method

str

default:"None"

Extraction method used (“openie”, “episodic”, “temporal”). Determines how chunk content is formatted.

embed

bool

default:"True"

Whether to generate embeddings.

remove_qualifiers

bool

default:"True"

For temporal method: whether to remove time qualifiers before embedding.

return

Dict

Dictionary mapping hash IDs to chunk data (content and metadata).

Example

chunk_meta = [
    {
        "content": "Paris is the capital of France.",
        "date": "2024-01-15",
        "role": "user"
    },
    {
        "content": "It is known for the Eiffel Tower.",
        "date": "2024-01-15",
        "role": "assistant"
    }
]

nodes_dict = store.insert_chunk_dicts(
    chunk_meta,
    extract_method="episodic"
)

print(f"Inserted {len(nodes_dict)} chunks")

Retrieval Methods

get_row

Retrieves a single row by hash ID.

get_row(hash_id: str) -> Dict

hash_id

str

required

The hash ID of the row to retrieve.

return

Dict

Dictionary containing:

hash_id: The unique identifier
content: The text content
metadata: Additional metadata (if available)

Example

row = store.get_row("chunk-abc123")
print(row["content"])
print(row.get("metadata", {}))

get_rows

Retrieves multiple rows by hash IDs (batch operation).

get_rows(hash_ids: List[str]) -> Dict[str, Dict]

hash_ids

List[str]

required

List of hash IDs to retrieve.

return

Dict[str, Dict]

Dictionary mapping hash IDs to row data dictionaries.

Example

hash_ids = ["chunk-abc123", "chunk-def456"]
rows = store.get_rows(hash_ids)

for hash_id, row in rows.items():
    print(f"{hash_id}: {row['content'][:50]}...")

get_all_ids

Returns all hash IDs in the store.

get_all_ids() -> List[str]

return

List[str]

List of all hash IDs in the store.

Example

all_ids = store.get_all_ids()
print(f"Total items: {len(all_ids)}")
print(f"First ID: {all_ids[0]}")

get_text_for_all_rows

Returns a deep copy of all rows with their content.

get_text_for_all_rows() -> Dict[str, Dict]

return

Dict[str, Dict]

Deep copy of the hash_id_to_row dictionary. Safe to modify without affecting the store.

Use this when you need to modify the returned data. For read-only access, use get_hash_id_to_row_readonly() for better performance.

Example

all_rows = store.get_text_for_all_rows()
for hash_id, row in all_rows.items():
    row["content"] = row["content"].upper()  # Safe to modify

get_hash_id_to_row_readonly

Returns a direct reference to the internal hash_id_to_row dictionary.

get_hash_id_to_row_readonly() -> Dict[str, Dict]

return

Dict[str, Dict]

Direct reference to the internal dictionary. Do not modify!

This returns a reference to the internal data structure. Do not modify the returned dictionary or its contents! Use get_text_for_all_rows() if you need to modify the data.

Example

# Read-only access (more efficient)
rows_ref = store.get_hash_id_to_row_readonly()
count = len(rows_ref)
print(f"Total rows: {count}")

# Don't do this!
# rows_ref["some_id"]["content"] = "modified"  # BAD!

Embedding Retrieval Methods

get_embedding

Retrieves embedding for a single hash ID.

get_embedding(
    hash_id: str,
    dtype = np.float32
) -> np.ndarray

hash_id

str

required

Hash ID of the item.

dtype

numpy dtype

default:"np.float32"

Data type for the returned embedding array.

return

np.ndarray

1D numpy array containing the embedding vector.

Example

import numpy as np

emb = store.get_embedding("chunk-abc123")
print(f"Embedding shape: {emb.shape}")
print(f"Embedding norm: {np.linalg.norm(emb)}")

get_embeddings

Retrieves embeddings for multiple hash IDs (batch operation).

get_embeddings(
    hash_ids: List[str],
    dtype = np.float32
) -> List[np.ndarray]

hash_ids

List[str]

required

List of hash IDs.

dtype

numpy dtype

default:"np.float32"

Data type for the returned embedding arrays.

return

List[np.ndarray]

List of embedding arrays, one for each hash ID.

Example

import numpy as np

hash_ids = ["chunk-abc123", "chunk-def456", "chunk-ghi789"]
embs = store.get_embeddings(hash_ids)

embs_array = np.array(embs)
print(f"Embeddings shape: {embs_array.shape}")

# Compute similarity
sim = np.dot(embs[0], embs[1])
print(f"Similarity: {sim}")

Storage Management

Internal Methods

The following methods are used internally and typically don’t need to be called directly:

_load_data

Loads stored embeddings from disk.

_load_data(from_scratch: bool) -> None

_save_data

Saves current embeddings to disk.

_save_data() -> None

This is called automatically after insertions. Manual calls are typically not needed.

_upsert

Inserts or updates embeddings in the store.

_upsert(
    hash_ids: List[str],
    texts: List[str],
    embeddings: List[np.ndarray],
    metadata: Optional[List[Dict]] = None
) -> None

Properties and Attributes

embeddings

store.embeddings  # List[np.ndarray]

Direct access to the list of all embeddings.

hash_ids

store.hash_ids  # List[str]

Direct access to the list of all hash IDs.

texts

store.texts  # List[str]

Direct access to the list of all text contents.

metadata

store.metadata  # List[Dict]

Direct access to the list of all metadata dictionaries.

Error Handling & Validation

Embedding Validation

The EmbeddingStore automatically validates and fixes problematic embeddings:

None embeddings: Replaced with zero vectors
NaN/Inf values: Replaced with zero vectors
Empty arrays: Replaced with zero vectors

# This is handled automatically
texts = ["Valid text", None, "Another valid text"]
store.insert_strings([t for t in texts if t is not None])

Corrupted embeddings are automatically replaced with zero vectors and logged. Check logs for warnings about replaced embeddings.

Complete Example

from remem.embedding_store import EmbeddingStore
from remem.embedding_model import _get_embedding_client
import numpy as np

# Initialize
embedding_model = _get_embedding_client("nvidia/NV-Embed-v2")()

chunk_store = EmbeddingStore(
    embedding_model=embedding_model,
    db_filename="./data/chunk_embeddings",
    batch_size=32,
    namespace="chunk",
    from_scratch=False
)

# Insert documents
docs = [
    "Paris is the capital of France.",
    "London is the capital of the UK.",
    "Berlin is the capital of Germany."
]

chunk_store.insert_strings(docs)

# Get all IDs
all_ids = chunk_store.get_all_ids()
print(f"Stored {len(all_ids)} documents")

# Retrieve specific documents
rows = chunk_store.get_rows(all_ids[:2])
for hash_id, row in rows.items():
    print(f"Document: {row['content']}")

# Get embeddings
embs = chunk_store.get_embeddings(all_ids)
embs_array = np.array(embs)

# Compute similarity matrix
sim_matrix = np.dot(embs_array, embs_array.T)
print(f"Similarity matrix shape: {sim_matrix.shape}")

# Find most similar to first document
first_doc_sims = sim_matrix[0]
top_idx = np.argsort(first_doc_sims)[::-1][1]  # Exclude self
print(f"Most similar to '{docs[0]}':")
print(f"  {docs[top_idx]} (similarity: {first_doc_sims[top_idx]:.3f})")

Storage Format

The EmbeddingStore saves data as a pickle file with the following structure:

{
    "hash_ids": List[str],
    "texts": List[str],
    "embeddings": List[np.ndarray],
    "metadata": List[Dict]  # Optional
}

File location: {db_filename}/vdb_{namespace}.pkl

Example file paths:

Chunks: ./embeddings/chunks/vdb_chunk.pkl
Entities: ./embeddings/entities/vdb_entity.pkl
Facts: ./embeddings/facts/vdb_fact.pkl

Best Practices

Batch Operations

Always use batch methods (get_rows, get_embeddings) instead of loops for better performance.

Read-Only Access

Use get_hash_id_to_row_readonly() for read-only operations to avoid unnecessary copying.

Incremental Updates

The store automatically handles incremental updates - only new items are encoded.

Namespace Organization

Use clear namespaces: “chunk”, “entity”, “fact”, “gist”, “verbatim” for different content types.

Performance Tips

Batch size: Tune batch_size based on your GPU memory
Deduplication: The store automatically deduplicates, so don’t pre-filter
Memory usage: Large stores load all embeddings into memory - monitor RAM usage
Persistence: Data is automatically saved after insertions

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

​Overview

​Constructor

​Example

​Insertion Methods

​insert_strings

​Example

​insert_chunk_dicts

​Example

​Retrieval Methods

​get_row

​Example

​get_rows

​Example

​get_all_ids

​Example

​get_text_for_all_rows

​Example

​get_hash_id_to_row_readonly

​Example

​Embedding Retrieval Methods

​get_embedding

​Example

​get_embeddings

​Example

​Storage Management

​Internal Methods

​_load_data

​_save_data

​_upsert

​Properties and Attributes

​embeddings

​hash_ids

​texts

​metadata

​Error Handling & Validation

​Embedding Validation

​Complete Example

​Storage Format

​Example file paths:

​Best Practices

Batch Operations

Read-Only Access

Incremental Updates

Namespace Organization

​Performance Tips

Build docs developers (and LLMs) love

Overview

Constructor

Example

Insertion Methods

insert_strings

Example

insert_chunk_dicts

Example

Retrieval Methods

get_row

Example

get_rows

Example

get_all_ids

Example

get_text_for_all_rows

Example

get_hash_id_to_row_readonly

Example

Embedding Retrieval Methods

get_embedding

Example

get_embeddings

Example

Storage Management

Internal Methods

_load_data

_save_data

_upsert

Properties and Attributes

embeddings

hash_ids

texts

metadata

Error Handling & Validation

Embedding Validation

Complete Example

Storage Format

Example file paths:

Best Practices

Performance Tips