Supported Model Formats
Vespa supports multiple model formats and frameworks:ONNX Models
Deploy ONNX models for inference in ranking and stateless evaluation
LightGBM & XGBoost
Native support for gradient boosting models
TensorFlow
Convert TensorFlow models to ONNX or ranking expressions
PyTorch
Export PyTorch models to ONNX for deployment
Model Integration Components
Vespa provides several built-in components for model integration:Embedders
Embedders transform text into vector representations for semantic search and retrieval:- BertBaseEmbedder - BERT-based text embeddings
- HuggingFaceEmbedder - Generic Hugging Face transformer models
- ColBertEmbedder - Multi-vector representations for token-level matching
- SpladeEmbedder - Sparse learned embeddings
See the Embeddings page for detailed configuration and examples.
Model Evaluation
Vespa supports two modes of model evaluation:Stateless Model Evaluation
Stateless Model Evaluation
Models are evaluated in container nodes using the
ModelsEvaluator API. This is suitable for:- Pre-processing and feature generation
- Stateless inference tasks
- REST API endpoints for model serving
Content Node Ranking
Content Node Ranking
Models are evaluated during document ranking on content nodes. This is optimal for:
- First-phase and second-phase ranking
- Per-document model inference
- Low-latency ranking with model scoring
Learn more about the differences in Stateless Model Evaluation.
Model Deployment Workflow
Export or Train Your Model
Train your model using your preferred framework (PyTorch, TensorFlow, etc.) and export to ONNX format.
Model Configuration
Vespa models are configured using theonnx-model element in schema files:
Runtime Options
Configure model execution settings inservices.xml:
Model Types by Use Case
Embedding Models
For semantic search and vector similarity:- BERT, RoBERTa, DistilBERT
- Sentence Transformers
- E5, BGE, GTE models
- ColBERT for multi-vector search
Ranking Models
For learning-to-rank and document scoring:- Cross-encoders (BERT-based rerankers)
- LightGBM, XGBoost
- Custom neural ranking models
Generation Models
For text generation and RAG applications:- T5, BART for sequence-to-sequence
- GPT models for completion
- Integration with LLM APIs
See RAG Applications for examples of combining retrieval and generation.
Performance Considerations
Model Size
- Small models (< 100MB): Can be evaluated on all nodes
- Medium models (100MB - 1GB): Consider stateless evaluation
- Large models (> 1GB): Use external model servers or GPU acceleration
Execution Modes
sequential: Single-threaded execution (default)parallel: Multi-threaded for batch processing
Next Steps
Text Embeddings
Configure embedder components for semantic search
ONNX Models
Deploy and configure ONNX models
Model Evaluation
Choose between stateless and ranking evaluation
RAG Applications
Build retrieval-augmented generation systems