Working with Sparse Vectors

Sparse vectors store only non-zero values as key-value pairs, making them efficient for lexical search where only a few dimensions are active. They excel at exact keyword matching and complement dense vectors in hybrid search.

Understanding Sparse Vectors

Unlike dense vectors that have values in every dimension, sparse vectors are represented as dictionaries mapping dimension indices to non-zero weights:

# Dense vector (all dimensions present)
dense = [0.1, 0.0, 0.0, 0.3, 0.0, 0.0, 0.2, ...]

# Sparse vector (only non-zero dimensions)
sparse = {0: 0.1, 3: 0.3, 6: 0.2}

When to Use Sparse Vectors

Keyword matching: Exact term matching for domain-specific queries
BM25 retrieval: Traditional IR ranking for document search
Hybrid search: Combine with dense vectors for best of both worlds
High-dimensional spaces: Efficient when most dimensions are zero

Creating a Sparse Vector Schema

Define the Sparse Vector Field

Sparse vectors don’t require specifying dimensions upfront:

from zvec import VectorSchema, DataType, HnswIndexParam

sparse_field = VectorSchema(
    name="bm25",
    data_type=DataType.SPARSE_VECTOR_FP32,
    # No dimension parameter needed
    index_param=HnswIndexParam()
)

Create Collection with Sparse Vectors

from zvec import CollectionSchema, FieldSchema
import zvec

schema = CollectionSchema(
    name="documents",
    fields=[
        FieldSchema("id", DataType.INT64),
        FieldSchema("text", DataType.STRING)
    ],
    vectors=sparse_field
)

zvec.init()
collection = zvec.create_and_open("./sparse_collection", schema)

Generating Sparse Vectors with BM25

Zvec provides built-in BM25 embedding for efficient sparse vector generation:

Import BM25 Embedding Function

from zvec.extension import BM25EmbeddingFunction

Option 1: Use Built-in Encoder

No training required, works out of the box:

# For queries (shorter text)
bm25_query = BM25EmbeddingFunction(
    language="en",           # "en" or "zh"
    encoding_type="query"
)

# For documents (longer text)
bm25_doc = BM25EmbeddingFunction(
    language="en",
    encoding_type="document"
)

Option 2: Train Custom Encoder

Better accuracy for domain-specific vocabulary:

# Your document corpus
corpus = [
    "Machine learning algorithms",
    "Natural language processing",
    "Vector database technology",
    # ... more documents
]

# Train BM25 on your corpus
bm25_custom = BM25EmbeddingFunction(
    corpus=corpus,
    encoding_type="query",
    b=0.75,   # Length normalization
    k1=1.2    # Term frequency saturation
)

Generate Sparse Embeddings

# Generate sparse vector
query_text = "machine learning algorithms"
sparse_vector = bm25_query.embed(query_text)

print(sparse_vector)
# Output: {1169440797: 0.29, 2045788977: 0.70, ...}

Inserting Sparse Vectors

With BM25 Embeddings

from zvec import Doc

documents = [
    "Machine learning is a subset of AI",
    "Natural language processing understands text",
    "Vector databases store embeddings efficiently"
]

# Generate and insert documents
bm25_doc = BM25EmbeddingFunction(language="en", encoding_type="document")

docs = []
for i, text in enumerate(documents):
    sparse_vec = bm25_doc.embed(text)
    doc = Doc(
        id=f"doc_{i}",
        fields={"id": i, "text": text},
        vectors={"bm25": sparse_vec}
    )
    docs.append(doc)

collection.insert(docs)

Manual Sparse Vector Creation

# Create sparse vector manually (useful for custom algorithms)
sparse_vec = {
    100: 1.5,   # Term ID 100 with weight 1.5
    250: 2.3,
    890: 0.8
}

doc = Doc(
    id="custom_001",
    fields={"id": 1, "text": "custom document"},
    vectors={"bm25": sparse_vec}
)

collection.insert(doc)

Querying Sparse Vectors

Basic Sparse Search

from zvec import VectorQuery

# Generate query sparse vector
query_text = "machine learning algorithms"
bm25_query = BM25EmbeddingFunction(language="en", encoding_type="query")
query_sparse = bm25_query.embed(query_text)

# Search
results = collection.query(
    VectorQuery(
        field_name="bm25",
        vector=query_sparse
    ),
    topk=10
)

for doc in results:
    print(f"{doc.id}: {doc.field('text')}")
    print(f"Score: {doc.score}\n")

With Metadata Filtering

results = collection.query(
    VectorQuery(
        field_name="bm25",
        vector=query_sparse
    ),
    filter="id > 100",
    topk=10
)

BM25 Parameters

Fine-tune BM25 behavior for your data:

bm25 = BM25EmbeddingFunction(
    corpus=my_corpus,
    encoding_type="query",
    b=0.75,    # Length normalization [0.0 - 1.0]
               # 0 = no normalization
               # 1 = full normalization
    k1=1.2     # Term frequency saturation
               # Higher = more weight on term frequency
)

Parameter Guidelines

Default values (b=0.75, k1=1.2) work well for most cases. Adjust if:

Documents vary greatly in length: Increase b (0.8-1.0)
Short queries/documents: Decrease b (0.5-0.7)
Repeated terms are important: Increase k1 (1.5-2.0)
Presence matters more than frequency: Decrease k1 (0.5-1.0)

Chinese Text Support

BM25 works seamlessly with Chinese text:

# Built-in Chinese encoder
bm25_zh = BM25EmbeddingFunction(
    language="zh",
    encoding_type="query"
)

# Generate Chinese sparse vector
query = "机器学习算法"
sparse_vec = bm25_zh.embed(query)

# Or train on Chinese corpus
chinese_corpus = [
    "机器学习是人工智能的一个重要分支",
    "深度学习使用多层神经网络",
    "自然语言处理技术"
]

bm25_custom_zh = BM25EmbeddingFunction(
    corpus=chinese_corpus,
    encoding_type="document"
)

Sparse Vector Storage

Sparse vectors are stored efficiently:

# Only non-zero dimensions are stored
sparse = {10: 0.5, 100: 1.2, 5000: 0.8}
# Storage: 3 key-value pairs (12-24 bytes)

# Dense equivalent would need:
dense = [0] * 5001  # 20KB+ of mostly zeros

Storage Trade-offs

Vector Type	Average Size	Best For
Sparse (BM25)	50-200 entries	Keyword search, long documents
Dense	Fixed (all dims)	Semantic similarity

Sparse vectors with thousands of non-zero entries may perform worse than dense vectors. BM25 typically produces 50-200 non-zero entries per document.

Data Type Options

# 32-bit float (recommended for BM25)
VectorSchema("sparse", DataType.SPARSE_VECTOR_FP32)

# 16-bit float (2x storage savings)
VectorSchema("sparse", DataType.SPARSE_VECTOR_FP16)

Performance Considerations

Batch Encode for Speed

# Slower: encode one at a time
for text in texts:
    sparse_vec = bm25.embed(text)

# Faster: leverage caching
bm25 = BM25EmbeddingFunction(language="en")
vectors = [bm25.embed(text) for text in texts]

Index Configuration

Sparse vectors benefit from proper indexing:

from zvec import HnswIndexParam

# Recommended for sparse vectors
index_param = HnswIndexParam(
    ef_construction=100,  # Lower than dense vectors
    m=8                   # Fewer connections needed
)

Cache Embeddings

BM25 results are cached automatically:

# First call: computes
vec1 = bm25.embed("machine learning")

# Second call: cached (instant)
vec2 = bm25.embed("machine learning")

Common Patterns

Combining Dense and Sparse

Store both vector types for hybrid search:

schema = CollectionSchema(
    name="hybrid_collection",
    fields=[FieldSchema("id", DataType.INT64)],
    vectors=[
        VectorSchema("dense", DataType.VECTOR_FP32, dimension=768),
        VectorSchema("sparse", DataType.SPARSE_VECTOR_FP32)
    ]
)

See the Hybrid Search guide for details.

Document and Query Encoding

# Index time: encode documents
bm25_doc = BM25EmbeddingFunction(encoding_type="document")
for text in documents:
    doc_vector = bm25_doc.embed(text)
    # Store doc_vector

# Query time: encode queries
bm25_query = BM25EmbeddingFunction(encoding_type="query")
query_vector = bm25_query.embed(user_query)

Using separate encoding_type for documents and queries improves ranking quality by applying different term weighting strategies.

Next Steps

Combine with dense vectors in Hybrid Search
Learn about Dense Vectors
Explore Embedding Functions for more options
Add Filtering to your sparse searches

Get Started

Core Concepts

Guides

Integrations

Advanced

Working with Sparse Vectors

Understanding Sparse Vectors

When to Use Sparse Vectors

Creating a Sparse Vector Schema

Generating Sparse Vectors with BM25

Inserting Sparse Vectors

With BM25 Embeddings

Manual Sparse Vector Creation

Querying Sparse Vectors

Basic Sparse Search

With Metadata Filtering

BM25 Parameters

Parameter Guidelines

Chinese Text Support

Sparse Vector Storage

Storage Trade-offs

Data Type Options

Performance Considerations

Common Patterns

Combining Dense and Sparse

Document and Query Encoding

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

Advanced

​Understanding Sparse Vectors

​When to Use Sparse Vectors

​Creating a Sparse Vector Schema

​Generating Sparse Vectors with BM25

​Inserting Sparse Vectors

​With BM25 Embeddings

​Manual Sparse Vector Creation

​Querying Sparse Vectors

​Basic Sparse Search

​With Metadata Filtering

​BM25 Parameters

​Parameter Guidelines

​Chinese Text Support

​Sparse Vector Storage

​Storage Trade-offs

​Data Type Options

​Performance Considerations

​Common Patterns

​Combining Dense and Sparse

​Document and Query Encoding

​Next Steps

Build docs developers (and LLMs) love

Understanding Sparse Vectors

When to Use Sparse Vectors

Creating a Sparse Vector Schema

Generating Sparse Vectors with BM25

Inserting Sparse Vectors

With BM25 Embeddings

Manual Sparse Vector Creation

Querying Sparse Vectors

Basic Sparse Search

With Metadata Filtering

BM25 Parameters

Parameter Guidelines

Chinese Text Support

Sparse Vector Storage

Storage Trade-offs

Data Type Options

Performance Considerations

Common Patterns

Combining Dense and Sparse

Document and Query Encoding

Next Steps