Skip to main content

Overview

REMem supports multiple embedding models for encoding text into dense vectors. The embedding model is specified via the embedding_model_name parameter in BaseConfig.

Supported Models

NV-Embed-v2 (Default)

NVIDIA’s state-of-the-art embedding model with 4096-dimensional embeddings.
from remem.utils.config_utils import BaseConfig

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",  # Default
    embedding_batch_size=16,
    embedding_max_seq_len=2048
)
Features:
  • 4096-dimensional embeddings
  • Supports instruction-based encoding
  • Multi-GPU support with automatic device mapping
  • Requires local GPU inference

OpenAI Embeddings

Use OpenAI’s hosted embedding models:
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",  # 3072 dimensions
    # OR
    # embedding_model_name="text-embedding-3-small",  # 1536 dimensions
    # embedding_model_name="text-embedding-ada-002",  # 1536 dimensions
    
    embedding_batch_size=16
)
Features:
  • Cloud-based (no local GPU required)
  • Automatic caching via SQLite
  • Handles content filtering gracefully
  • Parallel encoding for faster throughput

GritLM

Unified model for both retrieval and generation:
config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)
Features:
  • Can perform both embedding and text generation
  • Instruction-based encoding
  • Multi-GPU support

Custom OpenAI-Compatible Servers

Use local embedding servers with OpenAI-compatible APIs:
config = BaseConfig(
    embedding_model_name="custom-model-name",
    llm_base_url="http://localhost:8001/v1/"
)

Configuration Options

Batch Size

Control encoding throughput:
config = BaseConfig(
    embedding_batch_size=16,  # Default: 16
    # Increase for faster encoding (if GPU memory allows)
    # Decrease if running out of memory
)

Sequence Length

Set maximum input length:
config = BaseConfig(
    embedding_max_seq_len=2048,  # Default: 2048 tokens
    # Adjust based on your document lengths
)

Normalization

Control whether embeddings are normalized:
config = BaseConfig(
    embedding_return_as_normalized=True,  # Default: True
    # Normalized embeddings enable cosine similarity via dot product
)

Using NV-Embed-v2

Installation

Install dependencies for NV-Embed-v2:
pip install transformers torch pynvml

Multi-GPU Setup

NV-Embed-v2 automatically uses multiple GPUs:
import os

# Specify visible GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2"
)
From src/remem/embedding_model/NVEmbedV2.py:21-53, the model checks GPU usage and distributes across available devices:
# Automatic GPU allocation based on free memory
# GPUs with >10% usage are excluded
# Remaining GPUs share the embedding workload

Instruction-Based Encoding

NV-Embed-v2 supports task-specific instructions:
# Internal usage (handled by REMem)
embeddings = embedding_model.batch_encode(
    texts=["query text"],
    instruction="Retrieve relevant passages"  # Optional instruction
)

Using OpenAI Embeddings

Setup API Key

export OPENAI_API_KEY="your-api-key-here"

Basic Usage

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    embedding_batch_size=100  # OpenAI allows larger batches
)

Caching

OpenAI embeddings are automatically cached to reduce API costs:
# Cache location (auto-created):
# outputs/{dataset}/embedding_cache/{model_name}_embedding_cache.sqlite

config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    dataset="musique"  # Creates outputs/musique/embedding_cache/
)

Azure OpenAI

Use Azure-hosted OpenAI models:
export AZURE_OPENAI_API_KEY="your-azure-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export OPENAI_API_VERSION="2024-02-15-preview"
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    use_azure=True
)

Parallel Encoding

OpenAI embeddings support parallel processing:
# From src/remem/embedding_model/openai_embedding_client.py:283-309
# Automatic parallel encoding for large batches
# Each text is encoded independently to prevent batch failures

Using GritLM

Installation

pip install gritlm

Configuration

config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)

Instruction Format

GritLM uses a specific instruction format:
# From src/remem/embedding_model/GritLM.py:71-72
# Format: "<|user|>\n{instruction}\n<|embed|>\n"
# Or just: "<|embed|>\n" if no instruction

Custom Embedding Servers

Local Server Setup

Run a local embedding server with OpenAI-compatible API:
# Example with sentence-transformers server
python -m remem.embedding_model.sentence_transformer_server \
    --model nvidia/NV-Embed-v2 \
    --port 8001

Client Configuration

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",  # Model name
    llm_base_url="http://localhost:8001/v1/"  # Server URL
)
When using custom servers, ensure the server is running before initializing REMem.

Performance Optimization

GPU Memory Management

For NV-Embed-v2, optimize GPU allocation:
import os

# Use specific GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=8  # Reduce if OOM
)

Batch Size Tuning

Optimal batch sizes by model:
# NV-Embed-v2 (local GPU)
config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=16  # Default, adjust based on GPU memory
)

# OpenAI (API)
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    embedding_batch_size=100  # Larger batches for API calls
)

# GritLM (local GPU)
config = BaseConfig(
    embedding_model_name="GritLM/GritLM-7B",
    embedding_batch_size=16
)

Caching for OpenAI

Maximize cache hits to reduce costs:
config = BaseConfig(
    embedding_model_name="text-embedding-3-large",
    force_index_from_scratch=False,  # Reuse cached embeddings
    dataset="my_dataset"  # Consistent dataset name for cache
)

Embedding Dimensions

Different models produce different embedding sizes:
ModelDimensions
nvidia/NV-Embed-v24096
text-embedding-3-large3072
text-embedding-3-small1536
text-embedding-ada-0021536
GritLM/GritLM-7B4096
From src/remem/embedding_model/openai_embedding_client.py:24-35:
def _get_embedding_dimension(embedding_model_name: str) -> int:
    if "text-embedding-3-large" in embedding_model_name:
        return 3072
    elif "text-embedding-3-small" in embedding_model_name:
        return 1536
    elif "Qwen3-Embedding-8B" in embedding_model_name:
        return 4096
    elif "NV-Embed-v2" in embedding_model_name:
        return 4096
    # ...

Troubleshooting

Out of Memory (OOM) Errors

Reduce batch size or use fewer GPUs:
config = BaseConfig(
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=8,  # Reduced from 16
)

OpenAI Rate Limits

The client automatically retries with exponential backoff:
# From openai_embedding_client.py:54-87
# Automatic retry with backoff:
# - Base delay: 1 second
# - Max delay: 60 seconds
# - Exponential factor: 2
# - Max retries: 5 (configurable)

Content Filtering (OpenAI)

OpenAI may reject certain content. REMem creates zero embeddings as fallback:
# From openai_embedding_client.py:367-371
# Automatically handled - creates zero vector
# Logs warning but continues processing

Next Steps

Configuration

Explore all configuration options

Indexing

Learn how to index documents

Build docs developers (and LLMs) love