Skip to main content
Vespa supports two distinct modes for evaluating machine learning models: stateless evaluation in container nodes and ranking evaluation on content nodes. Understanding when to use each approach is crucial for optimal performance.

Overview

Vespa provides flexibility in where and how models are evaluated:

Stateless Evaluation

Models run in container nodes, independent of document ranking

Ranking Evaluation

Models run on content nodes during document ranking

Stateless Model Evaluation

Stateless model evaluation runs in container nodes using the ModelsEvaluator component. This is ideal for:
  • Pre-processing and feature generation
  • Query embedding generation
  • Stateless prediction endpoints
  • Model serving via REST API
  • Batch inference requests

Architecture

┌─────────────────────┐
│   Client Request    │
└──────────┬──────────┘


┌─────────────────────┐
│  Container Node     │
│  ┌───────────────┐  │
│  │ModelsEvaluator│  │  Stateless evaluation
│  │  ONNX Model   │  │  (no document access)
│  └───────────────┘  │
└─────────────────────┘


     JSON Response

Configuration

Enable stateless model evaluation in services.xml:
<services version="1.0">
  <container id="default" version="1.0">
    <!-- Enable model evaluation -->
    <model-evaluation />
    
    <!-- Models are loaded from rank profiles -->
    <nodes>
      <node hostalias="node1" />
    </nodes>
  </container>
  
  <content id="content" version="1.0">
    <documents>
      <document type="doc" mode="index" />
    </documents>
  </content>
</services>

ModelsEvaluator API

The ModelsEvaluator provides the core stateless evaluation API:
// From: model-evaluation/src/main/java/ai/vespa/models/evaluation/ModelsEvaluator.java:17-86
/**
 * Evaluates machine-learned models added to Vespa applications and available as config form.
 * Usage:
 * <code>Tensor result = evaluator.bind("foo", value).bind("bar", value").evaluate()</code>
 *
 * @author bratseth
 */
@Beta
public class ModelsEvaluator extends AbstractComponent {

    private final Map<String, Model> models;

    @Inject
    public ModelsEvaluator(RankProfilesConfig config,
                           RankingConstantsConfig constantsConfig,
                           RankingExpressionsConfig expressionsConfig,
                           OnnxModelsConfig onnxModelsConfig,
                           FileAcquirer fileAcquirer,
                           OnnxRuntime onnx) {
        this(new RankProfilesConfigImporter(fileAcquirer, onnx), config, constantsConfig, expressionsConfig, onnxModelsConfig);
    }

    /** Returns the models of this as an immutable map */
    public Map<String, Model> models() { return models; }

    /**
     * Returns a function which can be used to evaluate the given function in the given model
     *
     * @param modelName the name of the model
     * @param names the 0-2 name components identifying the output to compute
     * @throws IllegalArgumentException if the function or model is not present
     */
    public FunctionEvaluator evaluatorOf(String modelName, String ... names) {
        return requireModel(modelName).evaluatorOf(names);
    }
}

REST API

Access models via the stateless REST API:
// From: model-evaluation/src/main/java/ai/vespa/models/handler/ModelsEvaluationHandler.java:38-41
public static final String API_ROOT = "model-evaluation";
public static final String VERSION_V1 = "v1";
public static final String EVALUATE = "eval";

List Available Models

curl http://localhost:8080/model-evaluation/v1/
Response:
{
  "text_classifier": "http://localhost:8080/model-evaluation/v1/text_classifier",
  "embedder": "http://localhost:8080/model-evaluation/v1/embedder"
}

Get Model Information

curl http://localhost:8080/model-evaluation/v1/text_classifier
Response:
{
  "model": "text_classifier",
  "functions": [
    {
      "function": "default.output",
      "info": "http://localhost:8080/model-evaluation/v1/text_classifier/default.output",
      "eval": "http://localhost:8080/model-evaluation/v1/text_classifier/default.output/eval",
      "arguments": [
        {"name": "input", "type": "tensor<float>(d0[10])"}
      ]
    }
  ]
}

Evaluate Model

curl -X POST 'http://localhost:8080/model-evaluation/v1/text_classifier/eval' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'input=tensor<float>(d0[10]):[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]'
Response:
{
  "cells": [
    {"address": {"d0": "0"}, "value": 0.85},
    {"address": {"d0": "1"}, "value": 0.15}
  ]
}

Evaluation Handler

The REST handler processes evaluation requests:
// From: model-evaluation/src/main/java/ai/vespa/models/handler/ModelsEvaluationHandler.java:90-115
private HttpResponse evaluateModel(HttpRequest request, Model model, String[] function)  {
    FunctionEvaluator evaluator = model.evaluatorOf(function);

    property(request, missingValueKey).ifPresent(missingValue -> evaluator.setMissingValue(Tensor.from(missingValue)));

    for (Map.Entry<String, TensorType> argument : evaluator.function().argumentTypes().entrySet()) {
        Optional<String> value = property(request, argument.getKey());
        if (value.isPresent()) {
            try {
                evaluator.bind(argument.getKey(), Tensor.from(argument.getValue(), value.get()));
            } catch (IllegalArgumentException e) {
                evaluator.bind(argument.getKey(), value.get());  // since we don't yet support tensors with string values
            }
        }
    }
    Tensor result = evaluator.evaluate();
    return switch (property(request, "format.tensors").orElse("short").toLowerCase(java.util.Locale.ROOT)) {
        case "short"        -> new Response(200, JsonFormat.encode(result, true,  false));
        case "long"         -> new Response(200, JsonFormat.encode(result, false, false));
        case "short-value"  -> new Response(200, JsonFormat.encode(result, true,  true));
        case "long-value"   -> new Response(200, JsonFormat.encode(result, false, true));
        case "string"       -> new Response(200, result.toString(true, true).getBytes(StandardCharsets.UTF_8));
        default             -> new ErrorResponse(400, "Unknown tensor format");
    };
}

Use Cases

Generate query embeddings before search:
# Generate embedding for query
curl -X POST 'http://localhost:8080/model-evaluation/v1/embedder/eval' \
  -d 'text=machine learning tutorial'

# Use embedding in search query
vespa query 'yql=select * from doc where ...' \
  'input.query(q_embedding)=tensor<float>(x[384]):[0.1,0.2,...]'
Transform features before ranking:
import requests

# Preprocess features
response = requests.post(
    'http://localhost:8080/model-evaluation/v1/preprocessor/eval',
    data={'raw_features': 'tensor<float>(d0[5]):[1,2,3,4,5]'}
)

processed = response.json()

# Use in query
query_params = {
    'yql': 'select * from doc where ...',
    'input.query(features)': processed['tensor_string']
}
Process multiple inputs efficiently:
import requests
import concurrent.futures

def evaluate_item(item):
    return requests.post(
        'http://localhost:8080/model-evaluation/v1/classifier/eval',
        data={'input': item}
    ).json()

items = [f'tensor<float>(d0[10]):[{i},...]' for i in range(100)]

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(evaluate_item, items))

Ranking Model Evaluation

Ranking evaluation runs models on content nodes during document ranking. This is optimal for:
  • First-phase and second-phase ranking
  • Per-document model inference
  • Accessing document attributes and features
  • Low-latency ranking with model scoring

Architecture

┌─────────────────────┐
│  Container Node     │
│  (Query Processing) │
└──────────┬──────────┘


┌─────────────────────┐
│   Content Node      │
│  ┌───────────────┐  │
│  │  Documents    │  │  Ranking evaluation
│  │  + Model      │  │  (per-document scoring)
│  └───────────────┘  │
└─────────────────────┘


    Ranked Results

Configuration

Define models in schemas for ranking:
schema product {
    document product {
        field title type string {}
        field price type float {}
    }
    
    onnx-model ranker {
        file: models/ranker.onnx
        input features: rankingFeatures
    }
    
    rank-profile ml_ranking {
        function rankingFeatures() {
            expression: tensor<float>(d0[10]):[
                attribute(price),
                fieldMatch(title).completeness,
                attribute(popularity),
                freshness(timestamp),
                query(user_score),
                ...
            ]
        }
        
        first-phase {
            expression: bm25(title)
        }
        
        second-phase {
            expression: onnx(ranker).output
            rerank-count: 100
        }
    }
}

Use Cases

Use BERT-based cross-encoders for reranking:
onnx-model cross_encoder {
    file: models/cross_encoder.onnx
    input input_ids: inputTokens
    input attention_mask: inputMask
}

rank-profile rerank {
    first-phase {
        expression: bm25(title) + bm25(body)
    }
    
    second-phase {
        expression: onnx(cross_encoder).logits{d0:0,d1:0}
        rerank-count: 100
    }
}
Combine multiple features in ranking models:
rank-profile ml_features {
    function features() {
        expression: tensor<float>(d0[20]):[
            bm25(title),
            bm25(body),
            attribute(pagerank),
            fieldMatch(title).proximity,
            query(user_affinity),
            ...
        ]
    }
    
    first-phase {
        expression: onnx(ranking_model).score
    }
}

Comparison

AspectStateless EvaluationRanking Evaluation
LocationContainer nodesContent nodes
Document AccessNoYes (attributes, features)
Use CasePreprocessing, embeddingsDocument ranking
APIREST, Java APIRank profiles
LatencyIndependent of corpusPer-document evaluation
ScalabilityScale containersScale content nodes
CachingApplication-levelPer-query

When to Use Each

Use Stateless Evaluation When:

You need to generate embeddings or features before search
Model inputs don’t depend on document content
You want to expose model predictions via REST API
Processing is independent of corpus size
You need batch prediction capabilities

Use Ranking Evaluation When:

Model needs access to document attributes
Scoring documents during search
Implementing learning-to-rank
Using cross-encoders for reranking
Combining model scores with other ranking features

Hybrid Approaches

Combine both approaches for optimal performance:
schema hybrid {
    document hybrid {
        field text type string {}
        field embedding type tensor<float>(x[384]) {
            indexing: input text | embed | attribute
            attribute {
                distance-metric: angular
            }
        }
    }
    
    onnx-model reranker {
        file: models/reranker.onnx
    }
    
    rank-profile hybrid_search {
        inputs {
            query(q_embedding) tensor<float>(x[384])
        }
        
        # First: vector similarity (uses stateless-generated embeddings)
        first-phase {
            expression: closeness(field, embedding)
        }
        
        # Second: cross-encoder reranking (uses ranking evaluation)
        second-phase {
            expression: onnx(reranker).score
            rerank-count: 100
        }
    }
}
Query workflow:
import requests

# 1. Generate query embedding (stateless)
response = requests.post(
    'http://localhost:8080/model-evaluation/v1/embedder/eval',
    data={'text': 'search query'}
)
query_embedding = response.json()

# 2. Search with embedding and reranking (ranking evaluation)
results = requests.post(
    'http://localhost:8080/search/',
    json={
        'yql': 'select * from hybrid where ...',
        'ranking': 'hybrid_search',
        'input.query(q_embedding)': query_embedding
    }
)

Performance Considerations

Stateless Evaluation

  • Caching: Implement application-level caching for repeated inputs
  • Batching: Process multiple requests together when possible
  • Container Scaling: Add container nodes to handle more traffic

Ranking Evaluation

  • Rerank Count: Limit second-phase evaluation with rerank-count
  • Content Scaling: Add content nodes to distribute ranking load
  • Model Size: Keep models small for per-document evaluation

Next Steps

ONNX Models

Deploy ONNX models in Vespa

Embeddings

Configure embedding models

RAG Applications

Build retrieval-augmented generation

Ranking

Advanced ranking strategies

Build docs developers (and LLMs) love