Skip to main content
Vespa provides built-in embedder components that transform text into vector representations for semantic search, similarity matching, and retrieval tasks.

Overview

Embedders implement the Embedder interface and can be used during:
  • Document processing - Embed text fields when indexing documents
  • Query processing - Embed query text for semantic search
  • Custom processing - Use embedders in custom components
All embedders support:
  • Automatic tokenization and text preprocessing
  • Caching of embedding results
  • Configurable model parameters
  • ONNX model inference

Available Embedders

BertBaseEmbedder

BERT-based models with WordPiece tokenization

HuggingFaceEmbedder

Generic Hugging Face transformer models

ColBertEmbedder

Multi-vector token-level embeddings

SpladeEmbedder

Sparse learned embeddings

BertBaseEmbedder

The BertBaseEmbedder supports BERT and BERT-compatible models (DistilBERT, RoBERTa, etc.).

Configuration

<container id="default" version="1.0">
  <component id="myBertEmbedder" class="ai.vespa.embedding.BertBaseEmbedder" bundle="model-integration">
    <config name="embedding.bert-base-embedder">
      <tokenizerVocab>
        <model>models/vocab.txt</model>
      </tokenizerVocab>
      <transformerModel>
        <model>models/bert_model.onnx</model>
      </transformerModel>
      <transformerMaxTokens>384</transformerMaxTokens>
      <transformerInputIds>input_ids</transformerInputIds>
      <transformerAttentionMask>attention_mask</transformerAttentionMask>
      <transformerTokenTypeIds>token_type_ids</transformerTokenTypeIds>
      <transformerOutput>output_0</transformerOutput>
      <poolingStrategy>mean</poolingStrategy>
    </config>
  </component>
</container>

Model Requirements

BERT-compatible models must have three inputs:
// From: model-integration/src/main/java/ai/vespa/embedding/BertBaseEmbedder.java:23-30
/**
 * A BERT Base compatible embedder. This embedder uses a WordPiece embedder to
 * produce a token sequence that is then input to a transformer model. A BERT base
 * compatible transformer model must have three inputs:
 *
 *  - A token sequence (input_ids)
 *  - An attention mask (attention_mask)
 *  - Token types for cross encoding (token_type_ids)
 */

Pooling Strategies

Configure how token embeddings are pooled into a sentence embedding:
Average all token embeddings (recommended for most cases):
<poolingStrategy>mean</poolingStrategy>

Schema Integration

schema doc {
    document doc {
        field text type string {}
    }
    
    field embedding type tensor<float>(x[384]) {
        indexing: input text | embed myBertEmbedder | attribute
        attribute {
            distance-metric: angular
        }
    }
    
    rank-profile semantic {
        inputs {
            query(q) tensor<float>(x[384])
        }
        first-phase {
            expression: closeness(field, embedding)
        }
    }
}

HuggingFaceEmbedder

The HuggingFaceEmbedder supports any Hugging Face model exported to ONNX format.

Configuration

<component id="hf" class="ai.vespa.embedding.HuggingFaceEmbedder" bundle="model-integration">
  <config name="embedding.hugging-face-embedder">
    <tokenizerPath>
      <model>models/tokenizer.json</model>
    </tokenizerPath>
    <transformerModel>
      <model>models/model.onnx</model>
    </transformerModel>
    <transformerMaxTokens>512</transformerMaxTokens>
    <transformerInputIds>input_ids</transformerInputIds>
    <transformerAttentionMask>attention_mask</transformerAttentionMask>
    <transformerOutput>last_hidden_state</transformerOutput>
    <normalize>true</normalize>
    <poolingStrategy>mean</poolingStrategy>
  </config>
</component>

Model Inputs

The embedder automatically detects the number of inputs your model requires:
// From: model-integration/src/main/java/ai/vespa/embedding/HuggingFaceEmbedder.java:54-73
static ModelAnalysis analyze(OnnxEvaluator evaluator, HuggingFaceEmbedderConfig config) {
    Map<String, TensorType> inputs = evaluator.getInputInfo();
    int numInputs = inputs.size();
    String inputIdsName = config.transformerInputIds();
    String attentionMaskName = "";
    String tokenTypeIdsName = "";
    validateName(inputs, inputIdsName, "input");
    // some new models have only 1 input
    if (numInputs > 1) {
        attentionMaskName = config.transformerAttentionMask();
        validateName(inputs, attentionMaskName, "input");
        // newer models have only 2 inputs (they do not use token type IDs)
        if (numInputs > 2) {
            tokenTypeIdsName = config.transformerTokenTypeIds();
            validateName(inputs, tokenTypeIdsName, "input");
        }
    }
    // ...
}

Normalization

Enable L2 normalization for cosine similarity:
<normalize>true</normalize>
This normalizes embeddings to unit length, making cosine similarity equivalent to dot product.

Query and Document Instructions

Some models require different prompts for queries vs documents:
<prependQuery>query: </prependQuery>
<prependDocument>passage: </prependDocument>

Binary Quantization

Reduce memory usage with int8 quantization:
field embedding type tensor<int8>(x[64]) {
    indexing: input text | embed hf | attribute
}
// From: model-integration/src/main/java/ai/vespa/embedding/HuggingFaceEmbedder.java:216-234
private Tensor binaryQuantization(HuggingFaceEmbedder.HFEmbeddingResult embeddingResult, TensorType targetType) {
    long outputDimensions = embeddingResult.output().shape()[2];
    long targetDimensions = targetType.dimensions().get(0).size().get();
    //🪆 flexibility - packing only the first 8*targetDimension float values from the model output
    long targetUnpackagedDimensions = 8 * targetDimensions;
    if (targetUnpackagedDimensions > outputDimensions) {
        throw new IllegalArgumentException("Cannot pack " + outputDimensions + " into " + targetDimensions + " int8's");
    }
    // pool and normalize using float version before binary quantization
    TensorType poolingType = new TensorType.Builder(TensorType.Value.FLOAT).
                                     indexed(targetType.indexedSubtype().dimensions().get(0).name(), targetUnpackagedDimensions)
                                     .build();
    Tensor result = analysis.poolingStrategy().toSentenceEmbedding(poolingType, embeddingResult.output(), embeddingResult.attentionMask());
    result = normalize ? EmbeddingNormalizer.normalize(result, poolingType) : result;
    Tensor packedResult = Tensors.packBits(result);
    return packedResult;
}

ColBertEmbedder

ColBERT produces multiple vectors per text (one per token), enabling fine-grained similarity matching.

Configuration

<component id="colbert" class="ai.vespa.embedding.ColBertEmbedder" bundle="model-integration">
  <config name="embedding.col-bert-embedder">
    <tokenizerPath>
      <model>models/tokenizer.json</model>
    </tokenizerPath>
    <transformerModel>
      <model>models/colbert.onnx</model>
    </transformerModel>
    <transformerMaxTokens>512</transformerMaxTokens>
    <maxQueryTokens>32</maxQueryTokens>
    <maxDocumentTokens>256</maxDocumentTokens>
    <queryTokenId>1</queryTokenId>
    <documentTokenId>2</documentTokenId>
  </config>
</component>

Multi-Vector Schema

ColBERT requires a mixed tensor type:
schema doc {
    document doc {
        field text type string {}
    }
    
    field colbert type tensor<float>(token{}, x[128]) {
        indexing: input text | embed colbert | attribute
    }
    
    rank-profile colbert {
        inputs {
            query(qt) tensor<float>(qt{}, x[128])
        }
        first-phase {
            expression: sum(
                reduce(
                    sum(
                        query(qt) * attribute(colbert), x
                    ),
                    max, token
                ),
                qt
            )
        }
    }
}

Token Filtering

ColBERT automatically filters punctuation tokens for documents:
// From: model-integration/src/main/java/ai/vespa/embedding/ColBertEmbedder.java:45-105
private static final String PUNCTUATION = "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~";

protected TransformerInput buildTransformerInput(List<Long> tokens, int maxTokens, boolean isQuery) {
    if (!isQuery) {
        tokens = tokens.stream().filter(token -> !skipTokens.contains(token)).toList();
    }
    // ...
}

SpladeEmbedder

SPLADE creates sparse embeddings using learned term importance weights.

Configuration

<component id="splade" class="ai.vespa.embedding.SpladeEmbedder" bundle="model-integration">
  <config name="embedding.splade-embedder">
    <tokenizerPath>
      <model>models/tokenizer.json</model>
    </tokenizerPath>
    <transformerModel>
      <model>models/splade.onnx</model>
    </transformerModel>
    <termScoreThreshold>0.0</termScoreThreshold>
  </config>
</component>

Sparse Tensor Output

SPLADE produces a mapped tensor with vocabulary terms as labels:
field splade type tensor<float>(term{}) {
    indexing: input text | embed splade | attribute
    attribute: fast-search
}

Custom Reduction

SPLADE uses optimized reduction for performance:
// From: model-integration/src/main/java/ai/vespa/embedding/SpladeEmbedder.java:177-222
public Tensor sparsifyCustomReduce(IndexedTensor modelOutput, TensorType tensorType) {
    var builder = Tensor.Builder.of(tensorType);
    long[] shape = modelOutput.shape();
    int sequenceLength = (int) shape[1];
    int vocabSize = (int) shape[2];

    String dimension = tensorType.dimensions().get(0).name();
    long [] tokens = new long[1];
    DirectIndexedAddress directAddress = modelOutput.directAddress();
    directAddress.setIndex(0,0);
    for (int v = 0; v < vocabSize; v++) {
        double maxValue = 0.0d;
        directAddress.setIndex(2, v);
        long increment = directAddress.getStride(1);
        long directIndex = directAddress.getDirectIndex();
        for (int s = 0; s < sequenceLength; s++) {
            double value = modelOutput.get(directIndex + s * increment);
            if (value > maxValue) {
                maxValue = value;
            }
        }
        double logOfRelu = Math.log(1 + maxValue);
        if (logOfRelu > termScoreThreshold) {
            tokens[0] = v;
            String term = tokenizer.decode(tokens);
            builder.cell()
                    .label(dimension, term)
                    .value(logOfRelu);
        }
    }
    return builder.build();
}

Exporting Models to ONNX

From Hugging Face

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

model_id = "sentence-transformers/all-MiniLM-L6-v2"

# Export model to ONNX
model = ORTModelForFeatureExtraction.from_pretrained(
    model_id, export=True
)
model.save_pretrained("exported_model")

# Save tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained("exported_model")

From PyTorch

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("model-name")
tokenizer = AutoTokenizer.from_pretrained("model-name")

# Create dummy input
dummy_input = tokenizer("sample text", return_tensors="pt")

# Export to ONNX
torch.onnx.export(
    model,
    (dummy_input["input_ids"], 
     dummy_input["attention_mask"],
     dummy_input["token_type_ids"]),
    "model.onnx",
    input_names=["input_ids", "attention_mask", "token_type_ids"],
    output_names=["last_hidden_state"],
    dynamic_axes={
        "input_ids": {0: "batch", 1: "sequence"},
        "attention_mask": {0: "batch", 1: "sequence"},
        "token_type_ids": {0: "batch", 1: "sequence"},
        "last_hidden_state": {0: "batch", 1: "sequence"}
    },
    opset_version=14
)

Performance Tuning

Caching

Embedders automatically cache results per request:
// From: model-integration/src/main/java/ai/vespa/embedding/HuggingFaceEmbedder.java:180-183
private HuggingFaceEmbedder.HFEmbeddingResult lookupOrEvaluate(Context context, String text) {
    var key = new HFEmbedderCacheKey(context.getEmbedderId(), text);
    return context.computeCachedValueIfAbsent(key, () -> evaluate(context, text));
}

Thread Configuration

Configure ONNX Runtime threads in onnx-evaluator.def:
<config name="ai.vespa.modelintegration.evaluator.onnx-evaluator">
  <executionMode>sequential</executionMode>
  <interOpThreads>1</interOpThreads>
  <intraOpThreads>-4</intraOpThreads>  <!-- CPUs / 4 -->
</config>

GPU Acceleration

Enable GPU inference:
<gpuDevice>0</gpuDevice>  <!-- Use first GPU, -1 for CPU -->

Next Steps

ONNX Models

Learn about ONNX model deployment

Semantic Search

Build semantic search with embeddings

RAG Applications

Combine embeddings with generation

Performance

Optimize embedding performance

Build docs developers (and LLMs) love