Overview
REMem supports multiple embedding models for encoding text into dense vectors. The embedding model is specified via the embedding_model_name parameter in BaseConfig.
Supported Models
NV-Embed-v2 (Default)
NVIDIA’s state-of-the-art embedding model with 4096-dimensional embeddings.
from remem.utils.config_utils import BaseConfig
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" , # Default
embedding_batch_size = 16 ,
embedding_max_seq_len = 2048
)
Features:
4096-dimensional embeddings
Supports instruction-based encoding
Multi-GPU support with automatic device mapping
Requires local GPU inference
OpenAI Embeddings
Use OpenAI’s hosted embedding models:
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" , # 3072 dimensions
# OR
# embedding_model_name="text-embedding-3-small", # 1536 dimensions
# embedding_model_name="text-embedding-ada-002", # 1536 dimensions
embedding_batch_size = 16
)
Features:
Cloud-based (no local GPU required)
Automatic caching via SQLite
Handles content filtering gracefully
Parallel encoding for faster throughput
GritLM
Unified model for both retrieval and generation:
config = BaseConfig(
embedding_model_name = "GritLM/GritLM-7B" ,
embedding_batch_size = 16
)
Features:
Can perform both embedding and text generation
Instruction-based encoding
Multi-GPU support
Custom OpenAI-Compatible Servers
Use local embedding servers with OpenAI-compatible APIs:
config = BaseConfig(
embedding_model_name = "custom-model-name" ,
llm_base_url = "http://localhost:8001/v1/"
)
Configuration Options
Batch Size
Control encoding throughput:
config = BaseConfig(
embedding_batch_size = 16 , # Default: 16
# Increase for faster encoding (if GPU memory allows)
# Decrease if running out of memory
)
Sequence Length
Set maximum input length:
config = BaseConfig(
embedding_max_seq_len = 2048 , # Default: 2048 tokens
# Adjust based on your document lengths
)
Normalization
Control whether embeddings are normalized:
config = BaseConfig(
embedding_return_as_normalized = True , # Default: True
# Normalized embeddings enable cosine similarity via dot product
)
Using NV-Embed-v2
Installation
Install dependencies for NV-Embed-v2:
pip install transformers torch pynvml
Multi-GPU Setup
NV-Embed-v2 automatically uses multiple GPUs:
import os
# Specify visible GPUs
os.environ[ "CUDA_VISIBLE_DEVICES" ] = "0,1,2,3"
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2"
)
From src/remem/embedding_model/NVEmbedV2.py:21-53, the model checks GPU usage and distributes across available devices:
# Automatic GPU allocation based on free memory
# GPUs with >10% usage are excluded
# Remaining GPUs share the embedding workload
Instruction-Based Encoding
NV-Embed-v2 supports task-specific instructions:
# Internal usage (handled by REMem)
embeddings = embedding_model.batch_encode(
texts = [ "query text" ],
instruction = "Retrieve relevant passages" # Optional instruction
)
Using OpenAI Embeddings
Setup API Key
export OPENAI_API_KEY = "your-api-key-here"
Basic Usage
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
embedding_batch_size = 100 # OpenAI allows larger batches
)
Caching
OpenAI embeddings are automatically cached to reduce API costs:
# Cache location (auto-created):
# outputs/{dataset}/embedding_cache/{model_name}_embedding_cache.sqlite
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
dataset = "musique" # Creates outputs/musique/embedding_cache/
)
Azure OpenAI
Use Azure-hosted OpenAI models:
export AZURE_OPENAI_API_KEY = "your-azure-key"
export AZURE_OPENAI_ENDPOINT = "https://your-resource.openai.azure.com/"
export OPENAI_API_VERSION = "2024-02-15-preview"
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
use_azure = True
)
Parallel Encoding
OpenAI embeddings support parallel processing:
# From src/remem/embedding_model/openai_embedding_client.py:283-309
# Automatic parallel encoding for large batches
# Each text is encoded independently to prevent batch failures
Using GritLM
Installation
Configuration
config = BaseConfig(
embedding_model_name = "GritLM/GritLM-7B" ,
embedding_batch_size = 16
)
GritLM uses a specific instruction format:
# From src/remem/embedding_model/GritLM.py:71-72
# Format: "<|user|>\n{instruction}\n<|embed|>\n"
# Or just: "<|embed|>\n" if no instruction
Custom Embedding Servers
Local Server Setup
Run a local embedding server with OpenAI-compatible API:
# Example with sentence-transformers server
python -m remem.embedding_model.sentence_transformer_server \
--model nvidia/NV-Embed-v2 \
--port 8001
Client Configuration
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" , # Model name
llm_base_url = "http://localhost:8001/v1/" # Server URL
)
When using custom servers, ensure the server is running before initializing REMem.
GPU Memory Management
For NV-Embed-v2, optimize GPU allocation:
import os
# Use specific GPUs
os.environ[ "CUDA_VISIBLE_DEVICES" ] = "0,1"
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" ,
embedding_batch_size = 8 # Reduce if OOM
)
Batch Size Tuning
Optimal batch sizes by model:
# NV-Embed-v2 (local GPU)
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" ,
embedding_batch_size = 16 # Default, adjust based on GPU memory
)
# OpenAI (API)
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
embedding_batch_size = 100 # Larger batches for API calls
)
# GritLM (local GPU)
config = BaseConfig(
embedding_model_name = "GritLM/GritLM-7B" ,
embedding_batch_size = 16
)
Caching for OpenAI
Maximize cache hits to reduce costs:
config = BaseConfig(
embedding_model_name = "text-embedding-3-large" ,
force_index_from_scratch = False , # Reuse cached embeddings
dataset = "my_dataset" # Consistent dataset name for cache
)
Embedding Dimensions
Different models produce different embedding sizes:
Model Dimensions nvidia/NV-Embed-v2 4096 text-embedding-3-large 3072 text-embedding-3-small 1536 text-embedding-ada-002 1536 GritLM/GritLM-7B 4096
From src/remem/embedding_model/openai_embedding_client.py:24-35:
def _get_embedding_dimension ( embedding_model_name : str ) -> int :
if "text-embedding-3-large" in embedding_model_name:
return 3072
elif "text-embedding-3-small" in embedding_model_name:
return 1536
elif "Qwen3-Embedding-8B" in embedding_model_name:
return 4096
elif "NV-Embed-v2" in embedding_model_name:
return 4096
# ...
Troubleshooting
Out of Memory (OOM) Errors
Reduce batch size or use fewer GPUs:
config = BaseConfig(
embedding_model_name = "nvidia/NV-Embed-v2" ,
embedding_batch_size = 8 , # Reduced from 16
)
OpenAI Rate Limits
The client automatically retries with exponential backoff:
# From openai_embedding_client.py:54-87
# Automatic retry with backoff:
# - Base delay: 1 second
# - Max delay: 60 seconds
# - Exponential factor: 2
# - Max retries: 5 (configurable)
Content Filtering (OpenAI)
OpenAI may reject certain content. REMem creates zero embeddings as fallback:
# From openai_embedding_client.py:367-371
# Automatically handled - creates zero vector
# Logs warning but continues processing
Next Steps
Configuration Explore all configuration options
Indexing Learn how to index documents