Overview
Zvec provides integration with Sentence Transformers for local (offline) embeddings and reranking:
- Dense embeddings:
DefaultLocalDenseEmbedding - Uses all-MiniLM-L6-v2 model
- Sparse embeddings:
DefaultLocalSparseEmbedding - Uses SPLADE model
- Reranking:
DefaultLocalReRanker - Uses cross-encoder models
These models run entirely locally without requiring API keys or network connectivity (after initial download).
Installation
pip install sentence-transformers
For ModelScope support (recommended for users in China):
pip install sentence-transformers modelscope
Dense Embeddings
Basic Usage
from zvec.extension import DefaultLocalDenseEmbedding
# Using Hugging Face (default)
emb_func = DefaultLocalDenseEmbedding()
vector = emb_func.embed("Hello, world!")
print(f"Dimension: {len(vector)}")
# Output: Dimension: 384
Using ModelScope (China)
For users in China who experience Hugging Face access issues:
# Recommended for users in China
emb_func = DefaultLocalDenseEmbedding(model_source="modelscope")
vector = emb_func.embed("你好,世界!")
Alternatively, use Hugging Face mirror:
export HF_ENDPOINT=https://hf-mirror.com
emb_func = DefaultLocalDenseEmbedding() # Uses HF mirror
vector = emb_func.embed("Hello, world!")
GPU Acceleration
# Use GPU for faster inference
emb_func = DefaultLocalDenseEmbedding(device="cuda")
vector = emb_func.embed("Machine learning is fascinating")
# Apple Silicon
emb_func = DefaultLocalDenseEmbedding(device="mps")
Configuration Options
emb_func = DefaultLocalDenseEmbedding(
model_source="huggingface", # or "modelscope"
device="cuda", # "cpu", "cuda", "mps", or None
normalize_embeddings=True, # L2 normalize vectors
batch_size=32 # Batch size for encoding
)
Semantic Similarity
import numpy as np
emb_func = DefaultLocalDenseEmbedding()
v1 = emb_func.embed("The cat sits on the mat")
v2 = emb_func.embed("A feline rests on a rug")
v3 = emb_func.embed("Python programming")
similarity_high = np.dot(v1, v2) # Similar sentences
similarity_low = np.dot(v1, v3) # Different topics
print(f"High similarity: {similarity_high:.4f}")
print(f"Low similarity: {similarity_low:.4f}")
Sparse Embeddings
Basic Usage
Sparse embeddings are ideal for keyword-based search and hybrid retrieval:
from zvec.extension import DefaultLocalSparseEmbedding
# Query embedding
query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
query_vec = query_emb.embed("machine learning algorithms")
print(f"Type: {type(query_vec)}")
print(f"Non-zero dimensions: {len(query_vec)}")
# Output: Type: <class 'dict'>
# Output: Non-zero dimensions: 156
Asymmetric Retrieval
# Query embedding
query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
query_vec = query_emb.embed("what causes aging fast")
# Document embedding
doc_emb = DefaultLocalSparseEmbedding(encoding_type="document")
doc_vec = doc_emb.embed(
"UV-A light causes tanning, skin aging, and cataracts..."
)
# Calculate similarity (dot product)
similarity = sum(
query_vec.get(k, 0) * doc_vec.get(k, 0)
for k in set(query_vec) | set(doc_vec)
)
Memory-Efficient Model Caching
Both instances share the same underlying model to save memory:
# Both instances share the same model (~200MB, not 400MB)
query_emb = DefaultLocalSparseEmbedding(encoding_type="query")
doc_emb = DefaultLocalSparseEmbedding(encoding_type="document")
# Total memory: ~200MB thanks to model caching
Cache Management
# Check cache status
info = DefaultLocalSparseEmbedding.get_cache_info()
print(f"Cached models: {info['cached_models']}")
# Clear cache to free memory
DefaultLocalSparseEmbedding.clear_cache()
# Remove specific model from cache
removed = DefaultLocalSparseEmbedding.remove_from_cache(device="cuda")
print(f"Removed: {removed}")
Hybrid Retrieval
Combine dense and sparse embeddings for best retrieval performance:
from zvec.extension import DefaultLocalDenseEmbedding, DefaultLocalSparseEmbedding
dense_emb = DefaultLocalDenseEmbedding()
sparse_emb = DefaultLocalSparseEmbedding()
query = "deep learning neural networks"
dense_vec = dense_emb.embed(query) # [0.1, -0.3, 0.5, ...]
sparse_vec = sparse_emb.embed(query) # {12: 0.8, 45: 1.2, ...}
Reranking
Basic Usage
from zvec.extension import DefaultLocalReRanker
from zvec import Collection
# Initialize reranker
reranker = DefaultLocalReRanker(
query="machine learning algorithms",
topn=5,
rerank_field="content"
)
# Use in collection query
results = collection.query(
data={"vector": [query_vector]},
reranker=reranker,
topk=20 # Retrieve 20, rerank to top 5
)
Model Selection
# Default: MS MARCO MiniLM-L6-v2 (lightweight, fast, ~80MB)
reranker = DefaultLocalReRanker(
query="neural networks",
topn=5,
rerank_field="content"
)
# Better accuracy: MS MARCO MiniLM-L12-v2 (~120MB)
reranker = DefaultLocalReRanker(
query="neural networks",
topn=5,
rerank_field="content",
model_name="cross-encoder/ms-marco-MiniLM-L12-v2"
)
# Highest quality: BGE Reranker Large (~560MB)
reranker = DefaultLocalReRanker(
query="neural networks",
topn=5,
rerank_field="content",
model_name="BAAI/bge-reranker-large",
device="cuda",
batch_size=64
)
Available Models
| Model | Size | Description |
|---|
cross-encoder/ms-marco-MiniLM-L6-v2 | ~80MB | Lightweight, fast (default) |
cross-encoder/ms-marco-MiniLM-L12-v2 | ~120MB | Better accuracy |
BAAI/bge-reranker-base | ~280MB | BGE Reranker Base |
BAAI/bge-reranker-large | ~560MB | Highest quality |
Using ModelScope (China)
reranker = DefaultLocalReRanker(
query="机器学习算法",
topn=10,
rerank_field="content",
model_source="modelscope"
)
Configuration Options
reranker = DefaultLocalReRanker(
query="machine learning", # Required: query text
topn=10, # Number of results to return
rerank_field="content", # Required: document field
model_name="cross-encoder/ms-marco-MiniLM-L6-v2",
model_source="huggingface", # or "modelscope"
device="cuda", # "cpu", "cuda", "mps", or None
batch_size=32 # Batch size for processing
)
Using with Zvec Collections
Dense Embeddings
from zvec import Collection, DataType
from zvec.extension import DefaultLocalDenseEmbedding
emb_func = DefaultLocalDenseEmbedding()
collection = Collection(name="documents")
collection.create_field("id", DataType.INT64, is_primary=True)
collection.create_field("text", DataType.VARCHAR, max_length=512)
collection.create_field(
name="vector",
dtype=DataType.VECTOR_FP32,
dimension=384,
embedding_function=emb_func
)
collection.create()
# Insert data - embeddings generated automatically
collection.insert([
{"id": 1, "text": "Introduction to machine learning"},
{"id": 2, "text": "Deep learning with neural networks"},
{"id": 3, "text": "Natural language processing basics"}
])
# Query with automatic embedding
results = collection.query(
data={"vector": ["machine learning algorithms"]},
output_fields=["id", "text"],
topk=2
)
for result in results:
print(f"ID: {result['id']}, Text: {result['text']}")
Sparse Embeddings
from zvec import Collection, DataType
from zvec.extension import DefaultLocalSparseEmbedding
sparse_func = DefaultLocalSparseEmbedding(encoding_type="document")
collection = Collection(name="documents")
collection.create_field("id", DataType.INT64, is_primary=True)
collection.create_field("text", DataType.VARCHAR, max_length=512)
collection.create_field(
name="sparse_vector",
dtype=DataType.VECTOR_SPARSE_FP32,
dimension=30522, # SPLADE vocabulary size
embedding_function=sparse_func
)
collection.create()
Error Handling
try:
emb_func = DefaultLocalDenseEmbedding()
emb_func.embed("") # Empty string
except ValueError as e:
print(f"Error: {e}")
# Output: Error: Input text cannot be empty or whitespace only
try:
emb_func.embed(123) # Non-string input
except TypeError as e:
print(f"Error: {e}")
# Output: Error: Expected 'input' to be str, got int
DefaultLocalDenseEmbedding Configuration
model_source
string
default:"huggingface"
Model source: "huggingface" or "modelscope"
Device to run the model on: "cpu", "cuda", "mps", or None for automatic detection
Whether to normalize embeddings to unit length (L2 normalization)
DefaultLocalSparseEmbedding Configuration
model_source
string
default:"huggingface"
Model source: "huggingface" or "modelscope"
Device to run the model on
Encoding type: "query" or "document"
DefaultLocalReRanker Configuration
Query text for semantic re-ranking
Maximum number of documents to return after re-ranking
Document field name to use as re-ranking input text
model_name
string
default:"cross-encoder/ms-marco-MiniLM-L6-v2"
Cross-encoder model identifier or local path
model_source
string
default:"huggingface"
Model source: "huggingface" or "modelscope"
Device to run the model on
Batch size for processing query-document pairs
Notes
- DefaultLocalDenseEmbedding: Uses all-MiniLM-L6-v2 (Hugging Face) or nlp_gte_sentence-embedding_chinese-small (ModelScope)
- DefaultLocalSparseEmbedding: Uses naver/splade-cocondenser-ensembledistil
- DefaultLocalReRanker: Uses cross-encoder/ms-marco-MiniLM-L6-v2 by default
- Models are downloaded on first use and cached locally
- No API keys or network required after initial download
- GPU acceleration provides 5-10x speedup over CPU
- Hugging Face cache:
~/.cache/torch/sentence_transformers/
- ModelScope cache:
~/.cache/modelscope/hub/
See Also