Overview
Embedders implement theEmbedder interface and can be used during:
- Document processing - Embed text fields when indexing documents
- Query processing - Embed query text for semantic search
- Custom processing - Use embedders in custom components
- Automatic tokenization and text preprocessing
- Caching of embedding results
- Configurable model parameters
- ONNX model inference
Available Embedders
BertBaseEmbedder
BERT-based models with WordPiece tokenization
HuggingFaceEmbedder
Generic Hugging Face transformer models
ColBertEmbedder
Multi-vector token-level embeddings
SpladeEmbedder
Sparse learned embeddings
BertBaseEmbedder
TheBertBaseEmbedder supports BERT and BERT-compatible models (DistilBERT, RoBERTa, etc.).
Configuration
Model Requirements
BERT-compatible models must have three inputs:Pooling Strategies
Configure how token embeddings are pooled into a sentence embedding:- Mean Pooling
- CLS Pooling
Average all token embeddings (recommended for most cases):
Schema Integration
HuggingFaceEmbedder
TheHuggingFaceEmbedder supports any Hugging Face model exported to ONNX format.
Configuration
Model Inputs
The embedder automatically detects the number of inputs your model requires:Normalization
Enable L2 normalization for cosine similarity:Query and Document Instructions
Some models require different prompts for queries vs documents:Binary Quantization
Reduce memory usage with int8 quantization:ColBertEmbedder
ColBERT produces multiple vectors per text (one per token), enabling fine-grained similarity matching.Configuration
Multi-Vector Schema
ColBERT requires a mixed tensor type:Token Filtering
ColBERT automatically filters punctuation tokens for documents:SpladeEmbedder
SPLADE creates sparse embeddings using learned term importance weights.Configuration
Sparse Tensor Output
SPLADE produces a mapped tensor with vocabulary terms as labels:Custom Reduction
SPLADE uses optimized reduction for performance:Exporting Models to ONNX
From Hugging Face
From PyTorch
Performance Tuning
Caching
Embedders automatically cache results per request:Thread Configuration
Configure ONNX Runtime threads inonnx-evaluator.def:
GPU Acceleration
Enable GPU inference:Next Steps
ONNX Models
Learn about ONNX model deployment
Semantic Search
Build semantic search with embeddings
RAG Applications
Combine embeddings with generation
Performance
Optimize embedding performance