Skip to main content

Overview

NVEmbedV2EmbeddingModel provides access to NVIDIA’s NV-Embed-v2 model, a high-performance embedding model with 4096-dimensional vectors. The model supports multi-GPU deployment and handles long contexts up to 32,768 tokens.

Class Definition

from remem.embedding_model.NVEmbedV2 import NVEmbedV2EmbeddingModel
Defined in: src/remem/embedding_model/NVEmbedV2.py:56

Initialization

__init__

def __init__(
    self,
    global_config: Optional[BaseConfig] = None,
    embedding_model_name: Optional[str] = None
) -> None
Parameters:
global_config
BaseConfig
default:"None"
Global configuration object containing:
  • embedding_return_as_normalized: Whether to normalize embeddings
  • embedding_max_seq_len: Maximum sequence length (default: 32768)
  • embedding_batch_size: Batch size for encoding
embedding_model_name
str
default:"None"
Model name/path. If provided, overrides the name from global_config. Typically “nvidia/NV-Embed-v2”
Example:
from remem.utils.config_utils import BaseConfig
from remem.embedding_model.NVEmbedV2 import NVEmbedV2EmbeddingModel

config = BaseConfig()
config.embedding_model_name = "nvidia/NV-Embed-v2"
config.embedding_max_seq_len = 32768
config.embedding_batch_size = 16

model = NVEmbedV2EmbeddingModel(global_config=config)
print(f"Embedding dimension: {model.embedding_dim}")  # 4096

Attributes

embedding_model
AutoModel
The loaded HuggingFace transformer model
embedding_dim
int
Embedding dimension (4096 for NV-Embed-v2)
embedding_config
EmbeddingConfig
Configuration containing:
  • embedding_model_name: Model identifier
  • norm: Whether to normalize embeddings
  • model_init_params: Parameters for model loading
  • encode_params: Default encoding parameters

Methods

batch_encode

def batch_encode(self, texts: List[str], **kwargs) -> np.ndarray
Encodes a batch of text strings into embeddings. Parameters:
texts
List[str] | str
required
Text strings to encode. Can be a single string or list of strings.
instruction
str
default:"''"
Optional task instruction. Will be formatted as: "Instruct: {instruction}\nQuery: "
max_length
int
default:"32768"
Maximum sequence length for tokenization
batch_size
int
default:"16"
Number of texts to process in each batch
num_workers
int
default:"32"
Number of worker threads for processing
Returns:
embeddings
np.ndarray
2D numpy array of shape (n_texts, 4096). Normalized if embedding_return_as_normalized is True.
Example:
# Simple encoding
texts = [
    "What is machine learning?",
    "Explain neural networks"
]
embs = model.batch_encode(texts)
print(embs.shape)  # (2, 4096)

# With instruction
query_embs = model.batch_encode(
    ["machine learning"],
    instruction="Represent this query for retrieval"
)

doc_embs = model.batch_encode(
    ["Machine learning is a field of AI"],
    instruction=""  # No instruction for documents
)

# Large batch with custom batch size
large_corpus = [f"Document {i}" for i in range(1000)]
embs = model.batch_encode(large_corpus, batch_size=32)

GPU Memory Management

The model automatically distributes across available GPUs using the get_max_memory utility:
def get_max_memory(usage_threshold_percent: float = 10) -> Dict[int, str]
Defined in: src/remem/embedding_model/NVEmbedV2.py:21 Behavior:
  • Detects GPUs from CUDA_VISIBLE_DEVICES environment variable
  • Skips GPUs with usage above threshold (default 10%)
  • Allocates free memory on available GPUs
  • Automatically balances model across multiple GPUs
Example:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

# Model will automatically use available GPUs
model = NVEmbedV2EmbeddingModel(global_config=config)
# Output:
# DEVICE 0 is using 5.2% of its memory, has 78.3GB free. It is available for embedding model.
# DEVICE 1 is using 85.1% of its memory. It will not be used for embedding model.
# DEVICE 2 is using 3.1% of its memory, has 79.8GB free. It is available for embedding model.
# ...

Configuration Details

The model initializes with the following default configuration:
{
    "embedding_model_name": "nvidia/NV-Embed-v2",
    "norm": True,  # Normalize embeddings
    "model_init_params": {
        "pretrained_model_name_or_path": "nvidia/NV-Embed-v2",
        "trust_remote_code": True,
        "device_map": "auto",  # Multi-GPU support
        "max_memory": {0: "78GB", 1: "0GB", ...}  # Auto-detected
    },
    "encode_params": {
        "max_length": 32768,
        "instruction": "",
        "batch_size": 16,
        "num_workers": 32
    }
}

Performance Considerations

NV-Embed-v2 is a large model that requires significant GPU memory. For multi-GPU setups, ensure GPUs are relatively idle before loading the model.
Optimization Tips:
  1. Batch Size: Increase for higher throughput on large GPUs
  2. Multi-GPU: The model automatically uses available GPUs
  3. Long Contexts: Supports up to 32,768 tokens
  4. Normalization: Enabled by default for cosine similarity

See Also

Build docs developers (and LLMs) love