Basic Text Indexing
Define text fields in your schema:The
index: enable-bm25 directive enables BM25 ranking features for the field.Text Matching Operators
Contains (Term Matching)
Matches individual terms with linguistic processing:- Tokenizes the query
- Applies stemming (“searching” → “search”)
- Handles case normalization
Matches (Phrase Matching)
Exact phrase matching:Prefix Matching
Substring Matching
Suffix Matching
BM25 Ranking
Vespa implements the BM25 ranking algorithm, the industry-standard text relevance function. The implementation is insearchlib/src/vespa/searchlib/features/bm25_feature.cpp.
BM25 Formula
From the source code (searchlib/src/vespa/searchlib/features/bm25_feature.cpp:59-78):
Using BM25 in Ranking
BM25 Parameters
Customize k1 and b parameters:Understanding k1 parameter
Understanding k1 parameter
k1 controls term frequency saturation:
- Lower values (0.5-1.0): Less emphasis on term frequency
- Default (1.2): Balanced
- Higher values (1.5-2.0): More emphasis on term frequency
Understanding b parameter
Understanding b parameter
b controls document length normalization:
- b = 0: No length normalization
- b = 0.75: Default, balanced normalization
- b = 1.0: Full length normalization
Linguistic Processing
Vespa applies linguistic processing fromcontainer-search/src/main/java/com/yahoo/search/yql/YqlParser.java:26-31:
- Tokenization: Split text into tokens
- Normalization: Case folding, Unicode normalization
- Stemming: Reduce words to root form
- Segmentation: Language-specific word segmentation (e.g., Chinese, Japanese)
Controlling Linguistic Processing
Disable stemming for specific terms:Language Detection
Vespa can detect document language automatically:Multi-Field Search
Search across multiple fields:Fieldsets
Define fieldsets for convenience:Advanced Text Operators
Fuzzy Search
Tolerate typos and misspellings:maxEditDistance controls how many character edits are allowed (default: 2).Regular Expressions
WAND (Weak AND)
Efficiently find documents matching any of many terms:WeakAnd
Similar to WAND but for boolean queries:application/src/test/java/com/yahoo/application/ApplicationTest.java):
Text Ranking Features
Vespa provides many text-based ranking features:Term Frequency Features
Available Features
bm25(field): BM25 score for fieldfieldMatch(field): Advanced field matching scorefieldLength(field): Document field lengthfieldTermMatch(field, term_idx): Per-term matching infoterm(idx).significance: Term significance (IDF-based)termDistance(field, term1_idx, term2_idx): Distance between terms
Query Annotations
Fine-tune query term behavior:Text Search Performance
Indexing Performance
Query Performance
Use weakAnd for large result sets
When queries match many documents, weakAnd provides better latency
Stopwords
Configure stopwords to filter common words:Highlighting
Enable result highlighting:Best Practices
- Enable BM25 for text relevance: Use
index: enable-bm25on important text fields - Use appropriate match modes:
containsfor terms,matchesfor phrases - Leverage linguistic processing: Let Vespa handle stemming and normalization
- Combine with filters: Use structured filters to narrow results before text matching
- Monitor field lengths: Very long fields can impact ranking quality
Next Steps
- Learn about Ranking Expressions to combine text signals
- Explore Vector Search for semantic search
- Read about Grouping & Aggregation for result organization