Overview
BM25EmbeddingFunction provides text-to-sparse-vector embedding using the BM25 (Best Matching 25) algorithm. BM25 is a probabilistic retrieval function used for lexical search and document ranking based on term frequency and inverse document frequency.
Location: python/zvec/extension/bm25_embedding_function.py:24
Installation
Class Definition
Constructor
Parameters
List of documents to train the BM25 encoder. If provided, creates a custom encoder trained on this corpus for better domain-specific accuracy. If
None, uses the built-in encoder.Encoding mode for text processing:
"query": Optimized for search queries (default)"document": Optimized for document indexing
Language for built-in encoder (only used when corpus is None):
"zh": Chinese (trained on Chinese Wikipedia)"en": English
Document length normalization parameter for BM25. Range [0, 1]:
0: No normalization1: Full normalization
Term frequency saturation parameter for BM25. Higher values give more weight to term frequency. Only used with custom corpus.
Additional parameters for DashText encoder customization.
Properties
corpus_size
encoding_type
language
Methods
embed()
input(str): Input text string to embed. Must be non-empty.
SparseVectorType: Dictionary mapping vocabulary term index to BM25 score. Only non-zero scores included. Sorted by indices for consistency.
TypeError: If input is not a stringValueError: If input is empty or whitespace-onlyRuntimeError: If BM25 encoding fails
The
embed() method is cached with LRU cache (maxsize=10) for performance.Usage Examples
Option 1: Built-in Encoder
Use pre-trained encoders without providing a corpus:Chinese (Built-in)
Chinese (Document)
English (Built-in)
Option 2: Custom Corpus
Train on your own corpus for domain-specific accuracy:Asymmetric Retrieval
Callable Interface
Hybrid Search
Combine with dense embeddings for optimal retrieval:BM25 Parameters
k1 (Term Frequency Saturation)
- Range: [1.2, 2.0] typically
- Default: 1.2
- Effect:
- Lower values: Less emphasis on term frequency
- Higher values: More emphasis on term frequency
- Use higher k1 for long documents
b (Document Length Normalization)
- Range: [0, 1]
- Default: 0.75
- Effect:
b = 0: No normalization (document length ignored)b = 1: Full normalization (penalize long documents)b = 0.75: Balanced (common default)
Example: Tuning Parameters
Built-in vs Custom Encoder
- Built-in Encoder
- Custom Encoder
Advantages:
- No corpus needed
- Works out-of-the-box
- Good generalization
- Pre-trained on Wikipedia
- General-purpose search
- Quick prototyping
- When you don’t have a corpus
Error Handling
Best Practices
Encoding Types: Use appropriate encoding type:
encoding_type="query"for short search queriesencoding_type="document"for longer documents being indexed
Performance Characteristics
- Memory: O(vocabulary_size) for encoder
- Encoding Speed: ~1000-5000 docs/sec
- Output Size: ~50-200 non-zero dimensions per text
- Caching: Results cached (maxsize=10)
- No GPU: Runs on CPU only (DashText limitation)
Use Cases
Keyword Search
Exact term matching and lexical search
Document Ranking
Traditional information retrieval and BM25 scoring
Hybrid Retrieval
Combining with dense embeddings for best results
Domain-Specific
Custom vocabularies for specialized domains
Notes
- Requires Python 3.10, 3.11, or 3.12
- Requires
dashtextpackage:pip install dashtext - No API key or network required (local computation)
- Results are cached (LRU cache, maxsize=10)
- Output is sorted by indices for consistency
- DashText handles Chinese/English text segmentation automatically
See Also
- SparseEmbeddingFunction - Base class documentation
- DefaultLocalSparseEmbedding - SPLADE-based sparse embedding
- QwenSparseEmbedding - API-based sparse embedding
- DefaultLocalDenseEmbedding - Dense embedding for hybrid search
- Embedding Functions Overview
References
- DashText Documentation
- DashText PyPI
- BM25 Algorithm: Robertson & Zaragoza (2009)