Skip to main content
Sparse vectors store only non-zero values as key-value pairs, making them efficient for lexical search where only a few dimensions are active. They excel at exact keyword matching and complement dense vectors in hybrid search.

Understanding Sparse Vectors

Unlike dense vectors that have values in every dimension, sparse vectors are represented as dictionaries mapping dimension indices to non-zero weights:
# Dense vector (all dimensions present)
dense = [0.1, 0.0, 0.0, 0.3, 0.0, 0.0, 0.2, ...]

# Sparse vector (only non-zero dimensions)
sparse = {0: 0.1, 3: 0.3, 6: 0.2}

When to Use Sparse Vectors

  • Keyword matching: Exact term matching for domain-specific queries
  • BM25 retrieval: Traditional IR ranking for document search
  • Hybrid search: Combine with dense vectors for best of both worlds
  • High-dimensional spaces: Efficient when most dimensions are zero

Creating a Sparse Vector Schema

1

Define the Sparse Vector Field

Sparse vectors don’t require specifying dimensions upfront:
from zvec import VectorSchema, DataType, HnswIndexParam

sparse_field = VectorSchema(
    name="bm25",
    data_type=DataType.SPARSE_VECTOR_FP32,
    # No dimension parameter needed
    index_param=HnswIndexParam()
)
2

Create Collection with Sparse Vectors

from zvec import CollectionSchema, FieldSchema
import zvec

schema = CollectionSchema(
    name="documents",
    fields=[
        FieldSchema("id", DataType.INT64),
        FieldSchema("text", DataType.STRING)
    ],
    vectors=sparse_field
)

zvec.init()
collection = zvec.create_and_open("./sparse_collection", schema)

Generating Sparse Vectors with BM25

Zvec provides built-in BM25 embedding for efficient sparse vector generation:
1

Import BM25 Embedding Function

from zvec.extension import BM25EmbeddingFunction
2

Option 1: Use Built-in Encoder

No training required, works out of the box:
# For queries (shorter text)
bm25_query = BM25EmbeddingFunction(
    language="en",           # "en" or "zh"
    encoding_type="query"
)

# For documents (longer text)
bm25_doc = BM25EmbeddingFunction(
    language="en",
    encoding_type="document"
)
3

Option 2: Train Custom Encoder

Better accuracy for domain-specific vocabulary:
# Your document corpus
corpus = [
    "Machine learning algorithms",
    "Natural language processing",
    "Vector database technology",
    # ... more documents
]

# Train BM25 on your corpus
bm25_custom = BM25EmbeddingFunction(
    corpus=corpus,
    encoding_type="query",
    b=0.75,   # Length normalization
    k1=1.2    # Term frequency saturation
)
4

Generate Sparse Embeddings

# Generate sparse vector
query_text = "machine learning algorithms"
sparse_vector = bm25_query.embed(query_text)

print(sparse_vector)
# Output: {1169440797: 0.29, 2045788977: 0.70, ...}

Inserting Sparse Vectors

With BM25 Embeddings

from zvec import Doc

documents = [
    "Machine learning is a subset of AI",
    "Natural language processing understands text",
    "Vector databases store embeddings efficiently"
]

# Generate and insert documents
bm25_doc = BM25EmbeddingFunction(language="en", encoding_type="document")

docs = []
for i, text in enumerate(documents):
    sparse_vec = bm25_doc.embed(text)
    doc = Doc(
        id=f"doc_{i}",
        fields={"id": i, "text": text},
        vectors={"bm25": sparse_vec}
    )
    docs.append(doc)

collection.insert(docs)

Manual Sparse Vector Creation

# Create sparse vector manually (useful for custom algorithms)
sparse_vec = {
    100: 1.5,   # Term ID 100 with weight 1.5
    250: 2.3,
    890: 0.8
}

doc = Doc(
    id="custom_001",
    fields={"id": 1, "text": "custom document"},
    vectors={"bm25": sparse_vec}
)

collection.insert(doc)

Querying Sparse Vectors

from zvec import VectorQuery

# Generate query sparse vector
query_text = "machine learning algorithms"
bm25_query = BM25EmbeddingFunction(language="en", encoding_type="query")
query_sparse = bm25_query.embed(query_text)

# Search
results = collection.query(
    VectorQuery(
        field_name="bm25",
        vector=query_sparse
    ),
    topk=10
)

for doc in results:
    print(f"{doc.id}: {doc.field('text')}")
    print(f"Score: {doc.score}\n")

With Metadata Filtering

results = collection.query(
    VectorQuery(
        field_name="bm25",
        vector=query_sparse
    ),
    filter="id > 100",
    topk=10
)

BM25 Parameters

Fine-tune BM25 behavior for your data:
bm25 = BM25EmbeddingFunction(
    corpus=my_corpus,
    encoding_type="query",
    b=0.75,    # Length normalization [0.0 - 1.0]
               # 0 = no normalization
               # 1 = full normalization
    k1=1.2     # Term frequency saturation
               # Higher = more weight on term frequency
)

Parameter Guidelines

Default values (b=0.75, k1=1.2) work well for most cases. Adjust if:
  • Documents vary greatly in length: Increase b (0.8-1.0)
  • Short queries/documents: Decrease b (0.5-0.7)
  • Repeated terms are important: Increase k1 (1.5-2.0)
  • Presence matters more than frequency: Decrease k1 (0.5-1.0)

Chinese Text Support

BM25 works seamlessly with Chinese text:
# Built-in Chinese encoder
bm25_zh = BM25EmbeddingFunction(
    language="zh",
    encoding_type="query"
)

# Generate Chinese sparse vector
query = "机器学习算法"
sparse_vec = bm25_zh.embed(query)

# Or train on Chinese corpus
chinese_corpus = [
    "机器学习是人工智能的一个重要分支",
    "深度学习使用多层神经网络",
    "自然语言处理技术"
]

bm25_custom_zh = BM25EmbeddingFunction(
    corpus=chinese_corpus,
    encoding_type="document"
)

Sparse Vector Storage

Sparse vectors are stored efficiently:
# Only non-zero dimensions are stored
sparse = {10: 0.5, 100: 1.2, 5000: 0.8}
# Storage: 3 key-value pairs (12-24 bytes)

# Dense equivalent would need:
dense = [0] * 5001  # 20KB+ of mostly zeros

Storage Trade-offs

Vector TypeAverage SizeBest For
Sparse (BM25)50-200 entriesKeyword search, long documents
DenseFixed (all dims)Semantic similarity
Sparse vectors with thousands of non-zero entries may perform worse than dense vectors. BM25 typically produces 50-200 non-zero entries per document.

Data Type Options

# 32-bit float (recommended for BM25)
VectorSchema("sparse", DataType.SPARSE_VECTOR_FP32)

# 16-bit float (2x storage savings)
VectorSchema("sparse", DataType.SPARSE_VECTOR_FP16)

Performance Considerations

1

Batch Encode for Speed

# Slower: encode one at a time
for text in texts:
    sparse_vec = bm25.embed(text)

# Faster: leverage caching
bm25 = BM25EmbeddingFunction(language="en")
vectors = [bm25.embed(text) for text in texts]
2

Index Configuration

Sparse vectors benefit from proper indexing:
from zvec import HnswIndexParam

# Recommended for sparse vectors
index_param = HnswIndexParam(
    ef_construction=100,  # Lower than dense vectors
    m=8                   # Fewer connections needed
)
3

Cache Embeddings

BM25 results are cached automatically:
# First call: computes
vec1 = bm25.embed("machine learning")

# Second call: cached (instant)
vec2 = bm25.embed("machine learning")

Common Patterns

Combining Dense and Sparse

Store both vector types for hybrid search:
schema = CollectionSchema(
    name="hybrid_collection",
    fields=[FieldSchema("id", DataType.INT64)],
    vectors=[
        VectorSchema("dense", DataType.VECTOR_FP32, dimension=768),
        VectorSchema("sparse", DataType.SPARSE_VECTOR_FP32)
    ]
)
See the Hybrid Search guide for details.

Document and Query Encoding

# Index time: encode documents
bm25_doc = BM25EmbeddingFunction(encoding_type="document")
for text in documents:
    doc_vector = bm25_doc.embed(text)
    # Store doc_vector

# Query time: encode queries
bm25_query = BM25EmbeddingFunction(encoding_type="query")
query_vector = bm25_query.embed(user_query)
Using separate encoding_type for documents and queries improves ranking quality by applying different term weighting strategies.

Next Steps

Build docs developers (and LLMs) love