Machine Learning Models Overview

Vespa provides comprehensive support for integrating machine learning models into your search and recommendation applications. You can deploy models for embeddings, ranking, text generation, and custom inference tasks.

Supported Model Formats

Vespa supports multiple model formats and frameworks:

ONNX Models

Deploy ONNX models for inference in ranking and stateless evaluation

LightGBM & XGBoost

Native support for gradient boosting models

TensorFlow

Convert TensorFlow models to ONNX or ranking expressions

PyTorch

Export PyTorch models to ONNX for deployment

Model Integration Components

Vespa provides several built-in components for model integration:

Embedders

Embedders transform text into vector representations for semantic search and retrieval:

BertBaseEmbedder - BERT-based text embeddings
HuggingFaceEmbedder - Generic Hugging Face transformer models
ColBertEmbedder - Multi-vector representations for token-level matching
SpladeEmbedder - Sparse learned embeddings

See the Embeddings page for detailed configuration and examples.

Model Evaluation

Vespa supports two modes of model evaluation:

Stateless Model Evaluation

Models are evaluated in container nodes using the ModelsEvaluator API. This is suitable for:

Pre-processing and feature generation
Stateless inference tasks
REST API endpoints for model serving

// From: model-evaluation/src/main/java/ai/vespa/models/evaluation/ModelsEvaluator.java
public FunctionEvaluator evaluatorOf(String modelName, String ... names) {
    return requireModel(modelName).evaluatorOf(names);
}

Content Node Ranking

Models are evaluated during document ranking on content nodes. This is optimal for:

First-phase and second-phase ranking
Per-document model inference
Low-latency ranking with model scoring

Learn more about the differences in Stateless Model Evaluation.

Model Deployment Workflow

Export or Train Your Model

Train your model using your preferred framework (PyTorch, TensorFlow, etc.) and export to ONNX format.

import torch.onnx

# Export PyTorch model to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}}
)

Add Model to Application Package

Place the model file in your application package:

my-app/
├── services.xml
├── schemas/
└── models/
    └── my_model.onnx

Configure Model in Schema

Reference the model in your schema file:

schema doc {
    document doc {
        field text type string {}
    }
    
    onnx-model my_model {
        file: models/my_model.onnx
        input input_ids: input_tokens
        output output: last_hidden_state
    }
    
    rank-profile with_model {
        first-phase {
            expression: onnx(my_model).output
        }
    }
}

Deploy and Query

Deploy your application and start using the model in queries:

vespa deploy
vespa query 'yql=select * from doc where ...' \
  'ranking=with_model'

Model Configuration

Vespa models are configured using the onnx-model element in schema files:

onnx-model model_name {
    file: models/model.onnx
    
    # Map ONNX input names to Vespa expressions
    input input_ids: tokenInputIds
    input attention_mask: tokenAttentionMask
    
    # Map ONNX output names
    output embeddings: last_hidden_state
}

Runtime Options

Configure model execution settings in services.xml:

<container id="default" version="1.0">
  <component id="ai.vespa.modelintegration.evaluator.OnnxRuntime" 
             bundle="model-integration">
    <config name="ai.vespa.modelintegration.evaluator.onnx-evaluator">
      <execution_mode>sequential</execution_mode>
      <inter_op_threads>1</inter_op_threads>
      <intra_op_threads>-4</intra_op_threads>
      <gpu_device>0</gpu_device>
    </config>
  </component>
</container>

Model Types by Use Case

Embedding Models

For semantic search and vector similarity:

BERT, RoBERTa, DistilBERT
Sentence Transformers
E5, BGE, GTE models
ColBERT for multi-vector search

Ranking Models

For learning-to-rank and document scoring:

Cross-encoders (BERT-based rerankers)
LightGBM, XGBoost
Custom neural ranking models

Generation Models

For text generation and RAG applications:

T5, BART for sequence-to-sequence
GPT models for completion
Integration with LLM APIs

See RAG Applications for examples of combining retrieval and generation.

Performance Considerations

Model Size

Small models (< 100MB): Can be evaluated on all nodes
Medium models (100MB - 1GB): Consider stateless evaluation
Large models (> 1GB): Use external model servers or GPU acceleration

Execution Modes

// From: model-integration/src/main/java/ai/vespa/modelintegration/evaluator/OnnxEvaluator.java
public interface OnnxEvaluator extends AutoCloseable {
    Tensor evaluate(Map<String, Tensor> inputs, String output);
    Map<String, Tensor> evaluate(Map<String, Tensor> inputs);
    
    Map<String, TensorType> getInputInfo();
    Map<String, TensorType> getOutputInfo();
}

Configure execution mode based on your workload:

sequential: Single-threaded execution (default)
parallel: Multi-threaded for batch processing

Next Steps

Text Embeddings

Configure embedder components for semantic search

ONNX Models

Deploy and configure ONNX models

Model Evaluation

Choose between stateless and ranking evaluation

RAG Applications

Build retrieval-augmented generation systems

Get Started

Core Concepts

Search & Query

Data Operations

Machine Learning

Configuration & Deployment

Performance & Operations

Machine Learning Models Overview

Supported Model Formats

ONNX Models

LightGBM & XGBoost

TensorFlow

PyTorch

Model Integration Components

Embedders

Model Evaluation

Model Deployment Workflow

Model Configuration

Runtime Options

Model Types by Use Case

Embedding Models

Ranking Models

Generation Models

Performance Considerations

Model Size

Execution Modes

Next Steps

Text Embeddings

ONNX Models

Model Evaluation

RAG Applications

Build docs developers (and LLMs) love

Get Started

Core Concepts

Search & Query

Data Operations

Machine Learning

Configuration & Deployment

Performance & Operations

​Supported Model Formats

ONNX Models

LightGBM & XGBoost

TensorFlow

PyTorch

​Model Integration Components

​Embedders

​Model Evaluation

​Model Deployment Workflow

​Model Configuration

​Runtime Options

​Model Types by Use Case

​Embedding Models

​Ranking Models

​Generation Models

​Performance Considerations

​Model Size

​Execution Modes

​Next Steps

Text Embeddings

ONNX Models

Model Evaluation

RAG Applications

Build docs developers (and LLMs) love

Supported Model Formats

Model Integration Components

Embedders

Model Evaluation

Model Deployment Workflow

Model Configuration

Runtime Options

Model Types by Use Case

Embedding Models

Ranking Models

Generation Models

Performance Considerations

Model Size

Execution Modes

Next Steps