Sparse vectors store only non-zero values as key-value pairs, making them efficient for lexical search where only a few dimensions are active. They excel at exact keyword matching and complement dense vectors in hybrid search.
Understanding Sparse Vectors
Unlike dense vectors that have values in every dimension, sparse vectors are represented as dictionaries mapping dimension indices to non-zero weights:
# Dense vector (all dimensions present)
dense = [0.1, 0.0, 0.0, 0.3, 0.0, 0.0, 0.2, ...]
# Sparse vector (only non-zero dimensions)
sparse = {0: 0.1, 3: 0.3, 6: 0.2}
When to Use Sparse Vectors
- Keyword matching: Exact term matching for domain-specific queries
- BM25 retrieval: Traditional IR ranking for document search
- Hybrid search: Combine with dense vectors for best of both worlds
- High-dimensional spaces: Efficient when most dimensions are zero
Creating a Sparse Vector Schema
Define the Sparse Vector Field
Sparse vectors don’t require specifying dimensions upfront:from zvec import VectorSchema, DataType, HnswIndexParam
sparse_field = VectorSchema(
name="bm25",
data_type=DataType.SPARSE_VECTOR_FP32,
# No dimension parameter needed
index_param=HnswIndexParam()
)
Create Collection with Sparse Vectors
from zvec import CollectionSchema, FieldSchema
import zvec
schema = CollectionSchema(
name="documents",
fields=[
FieldSchema("id", DataType.INT64),
FieldSchema("text", DataType.STRING)
],
vectors=sparse_field
)
zvec.init()
collection = zvec.create_and_open("./sparse_collection", schema)
Generating Sparse Vectors with BM25
Zvec provides built-in BM25 embedding for efficient sparse vector generation:
Import BM25 Embedding Function
from zvec.extension import BM25EmbeddingFunction
Option 1: Use Built-in Encoder
No training required, works out of the box:# For queries (shorter text)
bm25_query = BM25EmbeddingFunction(
language="en", # "en" or "zh"
encoding_type="query"
)
# For documents (longer text)
bm25_doc = BM25EmbeddingFunction(
language="en",
encoding_type="document"
)
Option 2: Train Custom Encoder
Better accuracy for domain-specific vocabulary:# Your document corpus
corpus = [
"Machine learning algorithms",
"Natural language processing",
"Vector database technology",
# ... more documents
]
# Train BM25 on your corpus
bm25_custom = BM25EmbeddingFunction(
corpus=corpus,
encoding_type="query",
b=0.75, # Length normalization
k1=1.2 # Term frequency saturation
)
Generate Sparse Embeddings
# Generate sparse vector
query_text = "machine learning algorithms"
sparse_vector = bm25_query.embed(query_text)
print(sparse_vector)
# Output: {1169440797: 0.29, 2045788977: 0.70, ...}
Inserting Sparse Vectors
With BM25 Embeddings
from zvec import Doc
documents = [
"Machine learning is a subset of AI",
"Natural language processing understands text",
"Vector databases store embeddings efficiently"
]
# Generate and insert documents
bm25_doc = BM25EmbeddingFunction(language="en", encoding_type="document")
docs = []
for i, text in enumerate(documents):
sparse_vec = bm25_doc.embed(text)
doc = Doc(
id=f"doc_{i}",
fields={"id": i, "text": text},
vectors={"bm25": sparse_vec}
)
docs.append(doc)
collection.insert(docs)
Manual Sparse Vector Creation
# Create sparse vector manually (useful for custom algorithms)
sparse_vec = {
100: 1.5, # Term ID 100 with weight 1.5
250: 2.3,
890: 0.8
}
doc = Doc(
id="custom_001",
fields={"id": 1, "text": "custom document"},
vectors={"bm25": sparse_vec}
)
collection.insert(doc)
Querying Sparse Vectors
Basic Sparse Search
from zvec import VectorQuery
# Generate query sparse vector
query_text = "machine learning algorithms"
bm25_query = BM25EmbeddingFunction(language="en", encoding_type="query")
query_sparse = bm25_query.embed(query_text)
# Search
results = collection.query(
VectorQuery(
field_name="bm25",
vector=query_sparse
),
topk=10
)
for doc in results:
print(f"{doc.id}: {doc.field('text')}")
print(f"Score: {doc.score}\n")
results = collection.query(
VectorQuery(
field_name="bm25",
vector=query_sparse
),
filter="id > 100",
topk=10
)
BM25 Parameters
Fine-tune BM25 behavior for your data:
bm25 = BM25EmbeddingFunction(
corpus=my_corpus,
encoding_type="query",
b=0.75, # Length normalization [0.0 - 1.0]
# 0 = no normalization
# 1 = full normalization
k1=1.2 # Term frequency saturation
# Higher = more weight on term frequency
)
Parameter Guidelines
Default values (b=0.75, k1=1.2) work well for most cases. Adjust if:
- Documents vary greatly in length: Increase
b (0.8-1.0)
- Short queries/documents: Decrease
b (0.5-0.7)
- Repeated terms are important: Increase
k1 (1.5-2.0)
- Presence matters more than frequency: Decrease
k1 (0.5-1.0)
Chinese Text Support
BM25 works seamlessly with Chinese text:
# Built-in Chinese encoder
bm25_zh = BM25EmbeddingFunction(
language="zh",
encoding_type="query"
)
# Generate Chinese sparse vector
query = "机器学习算法"
sparse_vec = bm25_zh.embed(query)
# Or train on Chinese corpus
chinese_corpus = [
"机器学习是人工智能的一个重要分支",
"深度学习使用多层神经网络",
"自然语言处理技术"
]
bm25_custom_zh = BM25EmbeddingFunction(
corpus=chinese_corpus,
encoding_type="document"
)
Sparse Vector Storage
Sparse vectors are stored efficiently:
# Only non-zero dimensions are stored
sparse = {10: 0.5, 100: 1.2, 5000: 0.8}
# Storage: 3 key-value pairs (12-24 bytes)
# Dense equivalent would need:
dense = [0] * 5001 # 20KB+ of mostly zeros
Storage Trade-offs
| Vector Type | Average Size | Best For |
|---|
| Sparse (BM25) | 50-200 entries | Keyword search, long documents |
| Dense | Fixed (all dims) | Semantic similarity |
Sparse vectors with thousands of non-zero entries may perform worse than dense vectors. BM25 typically produces 50-200 non-zero entries per document.
Data Type Options
# 32-bit float (recommended for BM25)
VectorSchema("sparse", DataType.SPARSE_VECTOR_FP32)
# 16-bit float (2x storage savings)
VectorSchema("sparse", DataType.SPARSE_VECTOR_FP16)
Batch Encode for Speed
# Slower: encode one at a time
for text in texts:
sparse_vec = bm25.embed(text)
# Faster: leverage caching
bm25 = BM25EmbeddingFunction(language="en")
vectors = [bm25.embed(text) for text in texts]
Index Configuration
Sparse vectors benefit from proper indexing:from zvec import HnswIndexParam
# Recommended for sparse vectors
index_param = HnswIndexParam(
ef_construction=100, # Lower than dense vectors
m=8 # Fewer connections needed
)
Cache Embeddings
BM25 results are cached automatically:# First call: computes
vec1 = bm25.embed("machine learning")
# Second call: cached (instant)
vec2 = bm25.embed("machine learning")
Common Patterns
Combining Dense and Sparse
Store both vector types for hybrid search:
schema = CollectionSchema(
name="hybrid_collection",
fields=[FieldSchema("id", DataType.INT64)],
vectors=[
VectorSchema("dense", DataType.VECTOR_FP32, dimension=768),
VectorSchema("sparse", DataType.SPARSE_VECTOR_FP32)
]
)
See the Hybrid Search guide for details.
Document and Query Encoding
# Index time: encode documents
bm25_doc = BM25EmbeddingFunction(encoding_type="document")
for text in documents:
doc_vector = bm25_doc.embed(text)
# Store doc_vector
# Query time: encode queries
bm25_query = BM25EmbeddingFunction(encoding_type="query")
query_vector = bm25_query.embed(user_query)
Using separate encoding_type for documents and queries improves ranking quality by applying different term weighting strategies.
Next Steps