Text Search - Vespa

Vespa provides powerful full-text search capabilities with BM25 ranking, linguistic processing, and advanced text matching operators. The search engine handles tokenization, stemming, and relevance ranking out of the box.

Basic Text Indexing

Define text fields in your schema:

schema article {
  document article {
    field title type string {
      indexing: summary | index
      index: enable-bm25
    }
    
    field body type string {
      indexing: summary | index
      index: enable-bm25
    }
  }
}

The index: enable-bm25 directive enables BM25 ranking features for the field.

Text Matching Operators

Contains (Term Matching)

Matches individual terms with linguistic processing:

select * from sources article where title contains "search"

Vespa automatically:

Tokenizes the query
Applies stemming (“searching” → “search”)
Handles case normalization

Matches (Phrase Matching)

Exact phrase matching:

select * from sources article where title matches "search engine"

Prefix Matching

select * from sources article where title contains ({prefix: true}"sear")

Substring Matching

select * from sources article where title contains ({substring: true}"arch")

Suffix Matching

select * from sources article where title contains ({suffix: true}"arch")

BM25 Ranking

Vespa implements the BM25 ranking algorithm, the industry-standard text relevance function. The implementation is in searchlib/src/vespa/searchlib/features/bm25_feature.cpp.

BM25 Formula

From the source code (searchlib/src/vespa/searchlib/features/bm25_feature.cpp:59-78):

void Bm25Executor::execute(uint32_t doc_id)
{
    feature_t score = 0;
    for (const auto& term : _terms) {
        if (term.tfmd->has_ranking_data(doc_id)) {
            auto raw_num_occs = term.tfmd->getNumOccs();
            if (raw_num_occs == 0) {
                // Assume 1 occurrence and average field length
                score += term.degraded_score;
            } else {
                feature_t num_occs = raw_num_occs;
                feature_t norm_field_length = ((feature_t) term.tfmd->getFieldLength()) / _avg_field_length;
                feature_t numerator = num_occs * term.idf_mul_k1_plus_one;
                feature_t denominator = num_occs + (_k1_mul_one_minus_b + _k1_mul_b * norm_field_length);
                score += numerator / denominator;
            }
        }
    }
    outputs().set_number(0, score);
}

Using BM25 in Ranking

schema article {
  document article {
    field title type string {
      indexing: summary | index
      index: enable-bm25
    }
    field body type string {
      indexing: summary | index
      index: enable-bm25
    }
  }
  
  rank-profile bm25 {
    first-phase {
      expression: bm25(title) + bm25(body)
    }
  }
}

BM25 Parameters

Customize k1 and b parameters:

rank-profile custom-bm25 {
  rank-properties {
    bm25(title).k1: 1.5  # Term frequency saturation (default: 1.2)
    bm25(title).b: 0.8   # Length normalization (default: 0.75)
  }
  first-phase {
    expression: bm25(title)
  }
}

Understanding k1 parameter

k1 controls term frequency saturation:

Lower values (0.5-1.0): Less emphasis on term frequency
Default (1.2): Balanced
Higher values (1.5-2.0): More emphasis on term frequency

Understanding b parameter

b controls document length normalization:

b = 0: No length normalization
b = 0.75: Default, balanced normalization
b = 1.0: Full length normalization

Linguistic Processing

Vespa applies linguistic processing from container-search/src/main/java/com/yahoo/search/yql/YqlParser.java:26-31:

Tokenization: Split text into tokens
Normalization: Case folding, Unicode normalization
Stemming: Reduce words to root form
Segmentation: Language-specific word segmentation (e.g., Chinese, Japanese)

Controlling Linguistic Processing

Disable stemming for specific terms:

select * from sources article 
where title contains ({stem: false}"Vespa")

Language Detection

Vespa can detect document language automatically:

field body type string {
  indexing: summary | index
  language: detect
}

Multi-Field Search

Search across multiple fields:

select * from sources article 
where 
  title contains "vespa" 
  or body contains "vespa"

Fieldsets

Define fieldsets for convenience:

fieldset default {
  fields: title, body
}

Then query:

select * from sources article where default contains "vespa"

Advanced Text Operators

Fuzzy Search

Tolerate typos and misspellings:

select * from sources article 
where {maxEditDistance: 2}fuzzy(title, "serch")

maxEditDistance controls how many character edits are allowed (default: 2).

Regular Expressions

select * from sources article 
where title matches "search.*engine"

WAND (Weak AND)

Efficiently find documents matching any of many terms:

select * from sources article 
where {targetHits: 100}wand(title, {"vespa": 1, "search": 2, "engine": 1})

The numbers are term weights. WAND efficiently finds the top-k documents without evaluating all matches.

WeakAnd

Similar to WAND but for boolean queries:

select * from sources article 
where weakAnd(title contains "vespa", body contains "search") 
limit 10

From test case (application/src/test/java/com/yahoo/application/ApplicationTest.java):

equals("select * from sources * where weakAnd(substring contains \"foobar\") limit 2 timeout 20000000", 
       result.getQuery().yqlRepresentation(true));

Text Ranking Features

Vespa provides many text-based ranking features:

Term Frequency Features

rank-profile text-features {
  first-phase {
    expression: (
      fieldMatch(title) +
      fieldTermMatch(body, 0).occurrences +
      term(0).significance
    )
  }
}

Available Features

bm25(field): BM25 score for field
fieldMatch(field): Advanced field matching score
fieldLength(field): Document field length
fieldTermMatch(field, term_idx): Per-term matching info
term(idx).significance: Term significance (IDF-based)
termDistance(field, term1_idx, term2_idx): Distance between terms

Query Annotations

Fine-tune query term behavior:

select * from sources article 
where title contains ({weight: 200}"important")

Text Search Performance

Indexing Performance

Use appropriate field types

Use string for text fields that need full indexing

Enable BM25 selectively

Only enable BM25 on fields that need advanced ranking

Tune memory settings

Increase memory for large text corpora

Query Performance

Use weakAnd for large result sets

When queries match many documents, weakAnd provides better latency

Limit field searches

Query specific fields instead of all fields

Use prefix/substring sparingly

These operators are slower than exact matching

Set appropriate query timeout

Prevent slow queries from consuming resources

Stopwords

Configure stopwords to filter common words:

field body type string {
  indexing: summary | index
  filter: stopwords
}

Highlighting

Enable result highlighting:

select * from sources article 
where body contains "vespa" 
| all(group() each(output(summary(bolded))))

Best Practices

Enable BM25 for text relevance: Use index: enable-bm25 on important text fields
Use appropriate match modes: contains for terms, matches for phrases
Leverage linguistic processing: Let Vespa handle stemming and normalization
Combine with filters: Use structured filters to narrow results before text matching
Monitor field lengths: Very long fields can impact ranking quality

Purely prefix-based queries (e.g., "a*") can be expensive. Consider minimum prefix lengths or use suggest/autocomplete features.

Next Steps

Learn about Ranking Expressions to combine text signals
Explore Vector Search for semantic search
Read about Grouping & Aggregation for result organization

Get Started

Core Concepts

Search & Query

Data Operations

Machine Learning

Configuration & Deployment

Performance & Operations

​Basic Text Indexing

​Text Matching Operators

​Contains (Term Matching)

​Matches (Phrase Matching)

​Prefix Matching

​Substring Matching

​Suffix Matching

​BM25 Ranking

​BM25 Formula

​Using BM25 in Ranking

​BM25 Parameters

​Linguistic Processing

​Controlling Linguistic Processing

​Language Detection

​Multi-Field Search

​Fieldsets

​Advanced Text Operators

​Fuzzy Search

​Regular Expressions

​WAND (Weak AND)

​WeakAnd

​Text Ranking Features

​Term Frequency Features

​Available Features

​Query Annotations

​Text Search Performance

​Indexing Performance

​Query Performance

​Stopwords

​Highlighting

​Best Practices

​Next Steps

Build docs developers (and LLMs) love

Basic Text Indexing

Text Matching Operators

Contains (Term Matching)

Matches (Phrase Matching)

Prefix Matching

Substring Matching

Suffix Matching

BM25 Ranking

BM25 Formula

Using BM25 in Ranking

BM25 Parameters

Linguistic Processing

Controlling Linguistic Processing

Language Detection

Multi-Field Search

Fieldsets

Advanced Text Operators

Fuzzy Search

Regular Expressions

WAND (Weak AND)

WeakAnd

Text Ranking Features

Term Frequency Features

Available Features

Query Annotations

Text Search Performance

Indexing Performance

Query Performance

Stopwords

Highlighting

Best Practices

Next Steps