Overview
Zvec provides local embedding models that run without API calls using the Sentence Transformers library. These models run on your hardware (CPU/GPU) and work offline after initial download.
Location: python/zvec/extension/sentence_transformer_embedding_function.py
Installation
pip install sentence-transformers
# For ModelScope (recommended for users in China)
pip install modelscope
DefaultLocalDenseEmbedding
Local dense embedding using all-MiniLM-L6-v2 model (or Chinese-optimized alternative).
Constructor
from zvec.extension import DefaultLocalDenseEmbedding
DefaultLocalDenseEmbedding(
model_source: Literal["huggingface", "modelscope"] = "huggingface",
device: Optional[str] = None,
normalize_embeddings: bool = True,
batch_size: int = 32,
**kwargs
)
Parameters
model_source
Literal['huggingface', 'modelscope']
default:"huggingface"
Model source:
"huggingface": Use Hugging Face Hub (default, for international users)
"modelscope": Use ModelScope (recommended for users in China)
device
Optional[str]
default:"None"
Device to run the model on:
"cpu": CPU inference
"cuda": NVIDIA GPU
"mps": Apple Silicon GPU
None: Automatic detection
Whether to normalize embeddings to unit length (L2 normalization). Useful for cosine similarity.
Properties
dimension (int): Always 384 for both models
model_name (str): “all-MiniLM-L6-v2” (HF) or “iic/nlp_gte_sentence-embedding_chinese-small” (MS)
Methods
embed()
def embed(self, input: str) -> DenseVectorType:
"""Generate dense embedding vector for the input text."""
Parameters:
input (str): Input text string to embed. Maximum length typically 128-512 tokens.
Returns:
DenseVectorType: List of floats representing the embedding vector (384 dimensions).
Raises:
TypeError: If input is not a string
ValueError: If input is empty
RuntimeError: If model inference fails
Usage Examples
Basic Usage (Hugging Face)
from zvec.extension import DefaultLocalDenseEmbedding
emb_func = DefaultLocalDenseEmbedding()
vector = emb_func.embed("Hello, world!")
print(len(vector)) # 384
print(isinstance(vector, list)) # True
ModelScope (For Users in China)
# Recommended for users in China
emb_func = DefaultLocalDenseEmbedding(model_source="modelscope")
vector = emb_func.embed("你好,世界!") # Works well with Chinese
print(len(vector)) # 384
Alternative: Hugging Face Mirror
import os
# Use HF mirror for users in China
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
emb_func = DefaultLocalDenseEmbedding() # Uses HF mirror
vector = emb_func.embed("Hello, world!")
GPU Acceleration
# Use CUDA GPU
emb_func = DefaultLocalDenseEmbedding(device="cuda")
vector = emb_func.embed("Machine learning is fascinating")
# Normalized vectors have unit length
import numpy as np
print(np.linalg.norm(vector)) # 1.0
Semantic Similarity
import numpy as np
emb_func = DefaultLocalDenseEmbedding()
v1 = emb_func.embed("The cat sits on the mat")
v2 = emb_func.embed("A feline rests on a rug")
v3 = emb_func.embed("Python programming")
similarity_high = np.dot(v1, v2) # Similar sentences
similarity_low = np.dot(v1, v3) # Different topics
print(similarity_high > similarity_low) # True
DefaultLocalSparseEmbedding
Local sparse embedding using SPLADE (SParse Lexical AnD Expansion) model. Generates sparse, interpretable representations ideal for lexical matching and hybrid search.
Constructor
from zvec.extension import DefaultLocalSparseEmbedding
DefaultLocalSparseEmbedding(
model_source: Literal["huggingface", "modelscope"] = "huggingface",
device: Optional[str] = None,
encoding_type: Literal["query", "document"] = "query",
**kwargs
)
Parameters
model_source
Literal['huggingface', 'modelscope']
default:"huggingface"
Model source (ModelScope support may vary for SPLADE models).
device
Optional[str]
default:"None"
Device to run the model on ("cpu", "cuda", "mps", or None).
encoding_type
Literal['query', 'document']
default:"query"
Encoding type:
"query": Optimize for search queries (default)
"document": Optimize for indexed documents
Properties
model_name (str): “naver/splade-cocondenser-ensembledistil”
model_source (str): The model source being used
Methods
embed()
def embed(self, input: str) -> SparseVectorType:
"""Generate sparse embedding vector for the input text."""
Parameters:
input (str): Input text string to embed.
Returns:
SparseVectorType: Dictionary mapping dimension index to weight. Only non-zero dimensions included. Sorted by indices.
Raises:
TypeError: If input is not a string
ValueError: If input is empty
RuntimeError: If model inference fails
Cache Management
SPLADE models are cached at class level to save memory when using multiple instances:
# Clear all cached models
DefaultLocalSparseEmbedding.clear_cache()
# Get cache information
info = DefaultLocalSparseEmbedding.get_cache_info()
print(f"Cached models: {info['cached_models']}")
# Remove specific model from cache
removed = DefaultLocalSparseEmbedding.remove_from_cache(device="cuda")
Usage Examples
Basic Usage
from zvec.extension import DefaultLocalSparseEmbedding
# Query embedding
query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
query_vec = query_emb.embed("machine learning algorithms")
print(type(query_vec)) # <class 'dict'>
print(len(query_vec)) # ~150-200 non-zero dimensions
Memory-Efficient Dual Encoders
# Both instances share the same model (~200MB total, not 400MB)
query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
doc_emb = DefaultLocalSparseEmbedding(encoding_type="document")
query_vec = query_emb.embed("what causes aging fast")
doc_vec = doc_emb.embed(
"UV-A light causes tanning, skin aging, and cataracts..."
)
Asymmetric Retrieval
query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
doc_emb = DefaultLocalSparseEmbedding(encoding_type="document")
query_vec = query_emb.embed("machine learning")
doc_vec = doc_emb.embed("Machine learning is a subset of AI")
# Calculate similarity (dot product)
similarity = sum(
query_vec.get(k, 0) * doc_vec.get(k, 0)
for k in set(query_vec) | set(doc_vec)
)
print(f"Similarity: {similarity}")
Inspecting Sparse Dimensions
query_vec = query_emb.embed("machine learning")
# Sorted by indices
print(list(query_vec.items())[:5])
# [(10, 0.45), (23, 0.87), (56, 0.32), (89, 1.12), (120, 0.65)]
# Sort by weight to find top terms
top_terms = sorted(query_vec.items(), key=lambda x: x[1], reverse=True)[:5]
for idx, weight in top_terms:
print(f"Dimension {idx}: {weight:.3f}")
# Dimension 1023: 1.450
# Dimension 245: 1.230
# Dimension 8901: 0.980
Hybrid Retrieval
Combine dense and sparse embeddings for optimal search:
from zvec.extension import (
DefaultLocalDenseEmbedding,
DefaultLocalSparseEmbedding
)
# Dense for semantic similarity
dense_emb = DefaultLocalDenseEmbedding()
# Sparse for lexical matching
sparse_emb = DefaultLocalSparseEmbedding()
query = "deep learning neural networks"
# Get both embeddings
dense_vec = dense_emb.embed(query) # [0.1, -0.3, 0.5, ...]
sparse_vec = sparse_emb.embed(query) # {12: 0.8, 45: 1.2, ...}
# Combine scores for hybrid retrieval
# final_score = α * dense_score + (1-α) * sparse_score
Dense Model (all-MiniLM-L6-v2)
- Dimensions: 384
- Model Size: ~50-80MB
- Speed: ~1000 sentences/sec (CPU), ~10000 (GPU)
- Cache:
~/.cache/torch/sentence_transformers/
- Best For: General-purpose semantic similarity
Dense Model (ModelScope Chinese)
- Model: iic/nlp_gte_sentence-embedding_chinese-small
- Dimensions: 384
- Cache:
~/.cache/modelscope/hub/
- Best For: Chinese text processing
Sparse Model (SPLADE)
- Model: naver/splade-cocondenser-ensembledistil
- Dimensions: ~30,000 (vocabulary size)
- Non-zero values: ~100-200 per text
- Model Size: ~100MB
- Best For: Lexical matching, hybrid search
Best Practices
First Download: On first run, models are downloaded automatically. Ensure you have:
- Stable internet connection
- ~200MB free disk space
- Write permissions to cache directory
For Users in China: Use ModelScope or HF mirror to avoid connection issues:# Option 1: ModelScope
emb = DefaultLocalDenseEmbedding(model_source="modelscope")
# Option 2: HF Mirror
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
emb = DefaultLocalDenseEmbedding()
GPU Memory: Dense models require ~200MB GPU memory, sparse models ~300MB. Monitor GPU usage when using CUDA.
Comparison
| Feature | Dense (all-MiniLM) | Sparse (SPLADE) |
|---|
| Output Format | List (384 floats) | Dict (~150 non-zero) |
| Model Size | ~80MB | ~100MB |
| Inference Speed | Fast | Medium |
| Best For | Semantic similarity | Keyword matching |
| Interpretability | Low | High |
| Memory (per vector) | 1.5KB | 1-2KB |
Error Handling
try:
vector = emb_func.embed("") # Empty string
except ValueError as e:
print(f"Error: {e}")
# Error: Input text cannot be empty or whitespace only
try:
vector = emb_func.embed(123) # Wrong type
except TypeError as e:
print(f"Error: {e}")
# Error: Expected 'input' to be str, got int
Notes
- Requires Python 3.10, 3.11, or 3.12
- No API keys or authentication required
- Works offline after initial download
- First call slower due to model loading
- GPU provides 5-10x speedup over CPU
- Models stay in memory for subsequent calls
See Also