Overview
TheBM25EmbeddingFunction provides text-to-sparse-vector embedding capabilities using the DashText library with BM25 algorithm. BM25 (Best Matching 25) is a probabilistic retrieval function used for lexical search and document ranking based on term frequency and inverse document frequency.
BM25 generates sparse vectors where each dimension corresponds to a term in the vocabulary, and the value represents the BM25 score for that term.
Use Cases
- Lexical search: Keyword matching and exact term retrieval
- Document ranking: Information retrieval and search engines
- Hybrid search: Combining with dense embeddings for improved accuracy
- Traditional IR: Tasks where exact term matching is important
Installation
Basic Usage
Built-in Encoder (No Training Required)
DashText provides pre-trained BM25 encoders for Chinese and English:Custom Corpus Training
For domain-specific accuracy, train BM25 on your corpus:Asymmetric Retrieval
Optimize embeddings for query-document matching:Hybrid Search
Combine BM25 sparse embeddings with dense embeddings:Using with Zvec Collections
Configuration Parameters
BM25 Parameters (Custom Corpus Only)
-
b (0.75 default): Controls how much document length affects scoring
b=0: Disable length normalization (favor longer documents)b=1: Full normalization (penalize longer documents)- Typical range: 0.5-0.9
-
k1 (1.2 default): Controls term frequency saturation
- Higher values: More weight to term frequency
- Lower values: Diminishing returns for repeated terms
- Typical range: 1.0-2.0
Encoding Types
Built-in vs Custom Encoder
| Feature | Built-in Encoder | Custom Encoder |
|---|---|---|
| Setup | No training needed | Requires corpus |
| Languages | Chinese (zh), English (en) | Any |
| Use case | General purpose | Domain-specific |
| Accuracy | Good generalization | Better for specific domain |
| Training | Pre-trained on Wikipedia | Train on your corpus |
| Parameters | N/A | Customize b, k1 |
When to Use Each
Built-in encoder:- Quick prototyping
- General-purpose search
- No domain-specific terminology
- Limited corpus available
- Domain-specific content (medical, legal, technical)
- Large corpus available (1000+ documents)
- Need optimal accuracy for your data
- Specialized vocabulary
Error Handling
Sparse Vector Format
BM25 returns sparse vectors as dictionaries:Performance Tips
- Use built-in encoder for faster initialization
- Cache results: BM25 embedding is automatically cached (maxsize=10)
- Batch processing: Process multiple documents at once when possible
- Custom corpus size: Larger corpus (1000+ docs) improves accuracy
- Hybrid search: Combine with dense embeddings for best results
Configuration Reference
Optional corpus for training custom encoder. If
None, uses built-in encoderEncoding mode:
"query" for search queries, "document" for document indexingLanguage for built-in encoder:
"zh" (Chinese) or "en" (English). Only used when corpus is NoneDocument length normalization parameter [0, 1]. Only used with custom corpus
Term frequency saturation parameter. Only used with custom corpus
Notes
- Results are cached (LRU cache, maxsize=10) to reduce computation
- No API key or network connectivity required (fully local)
- Output is sorted by indices (vocabulary term IDs) for consistency
- Terms not in vocabulary will have zero scores (not included in output)
- DashText automatically handles Chinese/English text segmentation
- Sparse vectors are memory-efficient (only store non-zero values)
See Also
- DefaultLocalSparseEmbedding - SPLADE-based sparse embedding
- QwenSparseEmbedding - API-based sparse embedding using Qwen
- DefaultLocalDenseEmbedding - Dense embedding for semantic search
References
- DashText Documentation
- DashText PyPI
- BM25 Algorithm: Robertson & Zaragoza (2009)