Skip to main content
Vespa provides native support for ONNX (Open Neural Network Exchange) models, enabling you to deploy machine learning models from PyTorch, TensorFlow, scikit-learn, and other frameworks.

Overview

ONNX models can be used for:
  • Ranking - Score documents during search
  • Embeddings - Generate vector representations
  • Feature extraction - Transform data for downstream tasks
  • Stateless inference - Serve predictions via REST API
Vespa evaluates ONNX models using ONNX Runtime, providing high-performance inference on CPU and GPU.

Adding ONNX Models

1

Export Your Model to ONNX

Convert your trained model to ONNX format:
import torch
import torch.onnx

# Load your PyTorch model
model = MyModel()
model.load_state_dict(torch.load('model.pt'))
model.eval()

# Create dummy input
dummy_input = torch.randn(1, 10)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "my_model.onnx",
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    },
    opset_version=14
)
2

Add Model to Application Package

Place the ONNX file in your application’s models/ directory:
my-app/
├── services.xml
├── schemas/
│   └── doc.sd
└── models/
    └── my_model.onnx
3

Declare Model in Schema

Reference the model in your schema file:
schema doc {
    onnx-model my_model {
        file: models/my_model.onnx
        input input: my_input_expression
        output output: my_output
    }
}
4

Use Model in Ranking

Reference the model in rank profiles:
rank-profile with_onnx {
    function my_input_expression() {
        expression: tensor<float>(d0[10]):[1,2,3,4,5,6,7,8,9,10]
    }
    
    first-phase {
        expression: onnx(my_model).output
    }
}

Model Configuration

Basic Declaration

Declare an ONNX model in your schema:
onnx-model classifier {
    file: models/classifier.onnx
}

Input Mapping

Map ONNX input names to Vespa expressions:
onnx-model scorer {
    file: models/scorer.onnx
    
    # Map ONNX inputs to Vespa features
    input input_ids: tokenSequence
    input attention_mask: tokenMask
    input segment_ids: tokenTypes
}
// From: config-model/src/main/java/com/yahoo/schema/OnnxModel.java:57-84
private String validateInputSource(String source) {
    var optRef = Reference.simple(source);
    if (optRef.isPresent()) {
        Reference ref = optRef.get();
        // input can be one of:
        // attribute(foo), query(foo), constant(foo)
        if (FeatureNames.isSimpleFeature(ref)) {
            return ref.toString();
        }
        // or a function (evaluated by backend)
        if (ref.isSimpleRankingExpressionWrapper()) {
            var arg = ref.simpleArgument();
            if (arg.isPresent()) {
                return ref.toString();
            }
        }
    } else {
        // otherwise it must be an identifier
        Reference ref = Reference.fromIdentifier(source);
        return ref.toString();
    }
    // invalid input source
    throw new IllegalArgumentException("invalid input for ONNX model " + getName() + ": " + source);
}
Valid input sources:
  • attribute(field_name) - Document attribute
  • query(param_name) - Query parameter
  • constant(const_name) - Ranking constant
  • Function names defined in the rank profile

Output Mapping

Map ONNX output names to Vespa identifiers:
onnx-model encoder {
    file: models/encoder.onnx
    output embeddings: last_hidden_state
    output pooled: pooler_output
}
Reference outputs in ranking:
rank-profile semantic {
    first-phase {
        expression: onnx(encoder).embeddings
    }
}

ONNX Runtime Configuration

Configure ONNX Runtime execution in services.xml:
<container id="default" version="1.0">
  <component id="ai.vespa.modelintegration.evaluator.OnnxRuntime" 
             bundle="model-integration">
    <config name="ai.vespa.modelintegration.evaluator.onnx-evaluator">
      <!-- Execution mode: sequential or parallel -->
      <executionMode>sequential</executionMode>
      
      <!-- Number of threads for parallel execution -->
      <interOpThreads>1</interOpThreads>
      
      <!-- Intra-op threads: -4 means CPUs/4, 0 means CPUs, >0 is explicit count -->
      <intraOpThreads>-4</intraOpThreads>
      
      <!-- GPU device: 0+ for GPU device ID, -1 for CPU -->
      <gpuDevice>-1</gpuDevice>
    </config>
  </component>
</container>

Execution Modes

Single-threaded execution, best for low-latency inference:
<executionMode>sequential</executionMode>
<interOpThreads>1</interOpThreads>

GPU Acceleration

Enable GPU inference with CUDA:
<gpuDevice>0</gpuDevice>  <!-- Use first GPU -->
GPU support requires ONNX Runtime with CUDA provider. Ensure your deployment environment has compatible CUDA drivers.

Using ONNX Models

In Ranking Expressions

Reference ONNX models in rank profiles:
schema product {
    document product {
        field title type string {}
        field price type float {}
        field category type string {}
    }
    
    onnx-model ranker {
        file: models/ranker.onnx
        input features: featureVector
    }
    
    rank-profile ml_ranking {
        function featureVector() {
            expression: tensor<float>(d0[5]):[
                attribute(price),
                query(user_score),
                fieldMatch(title).completeness,
                attribute(popularity),
                freshness(timestamp)
            ]
        }
        
        first-phase {
            expression: onnx(ranker).output
        }
    }
}

With Multiple Outputs

Access specific model outputs:
onnx-model multi_output {
    file: models/multi.onnx
    output scores: output_scores
    output embeddings: output_embeddings
}

rank-profile combined {
    first-phase {
        expression: onnx(multi_output).scores
    }
    
    second-phase {
        expression: sum(onnx(multi_output).embeddings * query(q_vec))
    }
}

Stateless Evaluation API

Use the ModelsEvaluator API for stateless inference:
// From: model-evaluation/src/main/java/ai/vespa/models/evaluation/ModelsEvaluator.java:17-24
/**
 * Evaluates machine-learned models added to Vespa applications and available as config form.
 * Usage:
 * <code>Tensor result = evaluator.bind("foo", value).bind("bar", value").evaluate()</code>
 *
 * @author bratseth
 */
public class ModelsEvaluator extends AbstractComponent {
    public FunctionEvaluator evaluatorOf(String modelName, String ... names) {
        return requireModel(modelName).evaluatorOf(names);
    }
}
Access via REST API:
curl 'http://localhost:8080/model-evaluation/v1/my_model/eval' \
  -d '{"input": [1.0, 2.0, 3.0, 4.0, 5.0]}'

Model Optimization

Model Quantization

Reduce model size and improve performance with quantization:
from onnxruntime.quantization import quantize_dynamic, QuantType

# Dynamic quantization
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_quantized.onnx",
    weight_type=QuantType.QUInt8
)

Model Simplification

Simplify ONNX graphs:
import onnx
from onnxsim import simplify

# Load and simplify model
model = onnx.load("model.onnx")
model_simplified, check = simplify(model)
assert check, "Simplified model is invalid"

onnx.save(model_simplified, "model_simplified.onnx")

Dynamic Shapes

Support variable batch sizes and sequence lengths:
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    dynamic_axes={
        'input_ids': {0: 'batch', 1: 'sequence'},
        'attention_mask': {0: 'batch', 1: 'sequence'},
        'output': {0: 'batch', 1: 'sequence'}
    }
)

OnnxEvaluator Interface

The core evaluation interface:
// From: model-integration/src/main/java/ai/vespa/modelintegration/evaluator/OnnxEvaluator.java:10-29
/**
 * Evaluator for ONNX models.
 *
 * @author bjorncs
 */
public interface OnnxEvaluator extends AutoCloseable {

    record IdAndType(String id, TensorType type) { }

    Tensor evaluate(Map<String, Tensor> inputs, String output);
    Map<String, Tensor> evaluate(Map<String, Tensor> inputs);

    Map<String, OnnxEvaluator.IdAndType> getInputs();
    Map<String, OnnxEvaluator.IdAndType> getOutputs();
    Map<String, TensorType> getInputInfo();
    Map<String, TensorType> getOutputInfo();

    @Override void close();
}

Common Model Types

Classification Models

onnx-model classifier {
    file: models/classifier.onnx
    input features: featureVector
    output logits: output
}

rank-profile classify {
    function featureVector() {
        expression: tensor<float>(d0[100]):[...]
    }
    
    first-phase {
        expression: onnx(classifier).logits
    }
}

Reranking Models

onnx-model cross_encoder {
    file: models/cross_encoder.onnx
    input input_ids: inputSequence
    input attention_mask: inputMask
}

rank-profile rerank {
    first-phase {
        expression: bm25(title) + bm25(body)
    }
    
    second-phase {
        expression: onnx(cross_encoder).logits{d0:0,d1:0}
        rerank-count: 100
    }
}

Embedding Models

See the Embeddings page for embedding-specific models.

Troubleshooting

Model Validation

Vespa validates models at deployment:
vespa deploy
# Check for errors like:
# "Model does not contain required input: 'input_ids'"
# "Model contains: input_tokens, attention_scores"

Inspect Model Inputs/Outputs

Use onnx Python package:
import onnx

model = onnx.load("model.onnx")

print("Inputs:")
for input in model.graph.input:
    print(f"  {input.name}: {input.type}")

print("Outputs:")
for output in model.graph.output:
    print(f"  {output.name}: {output.type}")

Performance Issues

  • Reduce model size through quantization
  • Use dynamic batching for throughput
  • Enable GPU acceleration
  • Optimize intra-op thread count
  • Use model quantization (int8, uint8)
  • Limit number of concurrent evaluations
  • Monitor model size vs available RAM
  • Verify input tensor shapes and types
  • Check input/output name mappings
  • Validate preprocessing matches training
  • Test model with onnxruntime directly

Examples

TensorFlow to ONNX

import tensorflow as tf
import tf2onnx

# Load TensorFlow model
model = tf.keras.models.load_model('model.h5')

# Convert to ONNX
spec = (tf.TensorSpec((None, 10), tf.float32, name="input"),)
output_path = "model.onnx"

model_proto, _ = tf2onnx.convert.from_keras(
    model, 
    input_signature=spec,
    opset=14,
    output_path=output_path
)

scikit-learn to ONNX

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from sklearn.ensemble import RandomForestClassifier

# Train sklearn model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Convert to ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

Next Steps

Embeddings

Use ONNX models for text embeddings

Model Evaluation

Stateless vs ranking evaluation

RAG Applications

Combine models with retrieval

Performance Tuning

Optimize model inference

Build docs developers (and LLMs) love