ONNX Models - Vespa

Vespa provides native support for ONNX (Open Neural Network Exchange) models, enabling you to deploy machine learning models from PyTorch, TensorFlow, scikit-learn, and other frameworks.

Overview

ONNX models can be used for:

Ranking - Score documents during search
Embeddings - Generate vector representations
Feature extraction - Transform data for downstream tasks
Stateless inference - Serve predictions via REST API

Vespa evaluates ONNX models using ONNX Runtime, providing high-performance inference on CPU and GPU.

Adding ONNX Models

Export Your Model to ONNX

Convert your trained model to ONNX format:

import torch
import torch.onnx

# Load your PyTorch model
model = MyModel()
model.load_state_dict(torch.load('model.pt'))
model.eval()

# Create dummy input
dummy_input = torch.randn(1, 10)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "my_model.onnx",
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    },
    opset_version=14
)

Add Model to Application Package

Place the ONNX file in your application’s models/ directory:

my-app/
├── services.xml
├── schemas/
│   └── doc.sd
└── models/
    └── my_model.onnx

Declare Model in Schema

Reference the model in your schema file:

schema doc {
    onnx-model my_model {
        file: models/my_model.onnx
        input input: my_input_expression
        output output: my_output
    }
}

Use Model in Ranking

Reference the model in rank profiles:

rank-profile with_onnx {
    function my_input_expression() {
        expression: tensor<float>(d0[10]):[1,2,3,4,5,6,7,8,9,10]
    }
    
    first-phase {
        expression: onnx(my_model).output
    }
}

Model Configuration

Basic Declaration

Declare an ONNX model in your schema:

onnx-model classifier {
    file: models/classifier.onnx
}

Input Mapping

Map ONNX input names to Vespa expressions:

onnx-model scorer {
    file: models/scorer.onnx
    
    # Map ONNX inputs to Vespa features
    input input_ids: tokenSequence
    input attention_mask: tokenMask
    input segment_ids: tokenTypes
}

// From: config-model/src/main/java/com/yahoo/schema/OnnxModel.java:57-84
private String validateInputSource(String source) {
    var optRef = Reference.simple(source);
    if (optRef.isPresent()) {
        Reference ref = optRef.get();
        // input can be one of:
        // attribute(foo), query(foo), constant(foo)
        if (FeatureNames.isSimpleFeature(ref)) {
            return ref.toString();
        }
        // or a function (evaluated by backend)
        if (ref.isSimpleRankingExpressionWrapper()) {
            var arg = ref.simpleArgument();
            if (arg.isPresent()) {
                return ref.toString();
            }
        }
    } else {
        // otherwise it must be an identifier
        Reference ref = Reference.fromIdentifier(source);
        return ref.toString();
    }
    // invalid input source
    throw new IllegalArgumentException("invalid input for ONNX model " + getName() + ": " + source);
}

Valid input sources:

attribute(field_name) - Document attribute
query(param_name) - Query parameter
constant(const_name) - Ranking constant
Function names defined in the rank profile

Output Mapping

Map ONNX output names to Vespa identifiers:

onnx-model encoder {
    file: models/encoder.onnx
    output embeddings: last_hidden_state
    output pooled: pooler_output
}

Reference outputs in ranking:

rank-profile semantic {
    first-phase {
        expression: onnx(encoder).embeddings
    }
}

ONNX Runtime Configuration

Configure ONNX Runtime execution in services.xml:

<container id="default" version="1.0">
  <component id="ai.vespa.modelintegration.evaluator.OnnxRuntime" 
             bundle="model-integration">
    <config name="ai.vespa.modelintegration.evaluator.onnx-evaluator">
      <!-- Execution mode: sequential or parallel -->
      <executionMode>sequential</executionMode>
      
      <!-- Number of threads for parallel execution -->
      <interOpThreads>1</interOpThreads>
      
      <!-- Intra-op threads: -4 means CPUs/4, 0 means CPUs, >0 is explicit count -->
      <intraOpThreads>-4</intraOpThreads>
      
      <!-- GPU device: 0+ for GPU device ID, -1 for CPU -->
      <gpuDevice>-1</gpuDevice>
    </config>
  </component>
</container>

Execution Modes

Sequential (Default)
Parallel

Single-threaded execution, best for low-latency inference:

<executionMode>sequential</executionMode>
<interOpThreads>1</interOpThreads>

Multi-threaded execution for batch processing:

<executionMode>parallel</executionMode>
<interOpThreads>4</interOpThreads>
<intraOpThreads>2</intraOpThreads>

GPU Acceleration

Enable GPU inference with CUDA:

<gpuDevice>0</gpuDevice>  <!-- Use first GPU -->

GPU support requires ONNX Runtime with CUDA provider. Ensure your deployment environment has compatible CUDA drivers.

Using ONNX Models

In Ranking Expressions

Reference ONNX models in rank profiles:

schema product {
    document product {
        field title type string {}
        field price type float {}
        field category type string {}
    }
    
    onnx-model ranker {
        file: models/ranker.onnx
        input features: featureVector
    }
    
    rank-profile ml_ranking {
        function featureVector() {
            expression: tensor<float>(d0[5]):[
                attribute(price),
                query(user_score),
                fieldMatch(title).completeness,
                attribute(popularity),
                freshness(timestamp)
            ]
        }
        
        first-phase {
            expression: onnx(ranker).output
        }
    }
}

With Multiple Outputs

Access specific model outputs:

onnx-model multi_output {
    file: models/multi.onnx
    output scores: output_scores
    output embeddings: output_embeddings
}

rank-profile combined {
    first-phase {
        expression: onnx(multi_output).scores
    }
    
    second-phase {
        expression: sum(onnx(multi_output).embeddings * query(q_vec))
    }
}

Stateless Evaluation API

Use the ModelsEvaluator API for stateless inference:

// From: model-evaluation/src/main/java/ai/vespa/models/evaluation/ModelsEvaluator.java:17-24
/**
 * Evaluates machine-learned models added to Vespa applications and available as config form.
 * Usage:
 * <code>Tensor result = evaluator.bind("foo", value).bind("bar", value").evaluate()</code>
 *
 * @author bratseth
 */
public class ModelsEvaluator extends AbstractComponent {
    public FunctionEvaluator evaluatorOf(String modelName, String ... names) {
        return requireModel(modelName).evaluatorOf(names);
    }
}

Access via REST API:

curl 'http://localhost:8080/model-evaluation/v1/my_model/eval' \
  -d '{"input": [1.0, 2.0, 3.0, 4.0, 5.0]}'

Model Optimization

Model Quantization

Reduce model size and improve performance with quantization:

from onnxruntime.quantization import quantize_dynamic, QuantType

# Dynamic quantization
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_quantized.onnx",
    weight_type=QuantType.QUInt8
)

Model Simplification

Simplify ONNX graphs:

import onnx
from onnxsim import simplify

# Load and simplify model
model = onnx.load("model.onnx")
model_simplified, check = simplify(model)
assert check, "Simplified model is invalid"

onnx.save(model_simplified, "model_simplified.onnx")

Dynamic Shapes

Support variable batch sizes and sequence lengths:

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    dynamic_axes={
        'input_ids': {0: 'batch', 1: 'sequence'},
        'attention_mask': {0: 'batch', 1: 'sequence'},
        'output': {0: 'batch', 1: 'sequence'}
    }
)

OnnxEvaluator Interface

The core evaluation interface:

// From: model-integration/src/main/java/ai/vespa/modelintegration/evaluator/OnnxEvaluator.java:10-29
/**
 * Evaluator for ONNX models.
 *
 * @author bjorncs
 */
public interface OnnxEvaluator extends AutoCloseable {

    record IdAndType(String id, TensorType type) { }

    Tensor evaluate(Map<String, Tensor> inputs, String output);
    Map<String, Tensor> evaluate(Map<String, Tensor> inputs);

    Map<String, OnnxEvaluator.IdAndType> getInputs();
    Map<String, OnnxEvaluator.IdAndType> getOutputs();
    Map<String, TensorType> getInputInfo();
    Map<String, TensorType> getOutputInfo();

    @Override void close();
}

Common Model Types

Classification Models

onnx-model classifier {
    file: models/classifier.onnx
    input features: featureVector
    output logits: output
}

rank-profile classify {
    function featureVector() {
        expression: tensor<float>(d0[100]):[...]
    }
    
    first-phase {
        expression: onnx(classifier).logits
    }
}

Reranking Models

onnx-model cross_encoder {
    file: models/cross_encoder.onnx
    input input_ids: inputSequence
    input attention_mask: inputMask
}

rank-profile rerank {
    first-phase {
        expression: bm25(title) + bm25(body)
    }
    
    second-phase {
        expression: onnx(cross_encoder).logits{d0:0,d1:0}
        rerank-count: 100
    }
}

Embedding Models

See the Embeddings page for embedding-specific models.

Troubleshooting

Model Validation

Vespa validates models at deployment:

vespa deploy
# Check for errors like:
# "Model does not contain required input: 'input_ids'"
# "Model contains: input_tokens, attention_scores"

Inspect Model Inputs/Outputs

Use onnx Python package:

import onnx

model = onnx.load("model.onnx")

print("Inputs:")
for input in model.graph.input:
    print(f"  {input.name}: {input.type}")

print("Outputs:")
for output in model.graph.output:
    print(f"  {output.name}: {output.type}")

Performance Issues

High Latency

Reduce model size through quantization
Use dynamic batching for throughput
Enable GPU acceleration
Optimize intra-op thread count

Memory Usage

Use model quantization (int8, uint8)
Limit number of concurrent evaluations
Monitor model size vs available RAM

Incorrect Results

Verify input tensor shapes and types
Check input/output name mappings
Validate preprocessing matches training
Test model with onnxruntime directly

Examples

TensorFlow to ONNX

import tensorflow as tf
import tf2onnx

# Load TensorFlow model
model = tf.keras.models.load_model('model.h5')

# Convert to ONNX
spec = (tf.TensorSpec((None, 10), tf.float32, name="input"),)
output_path = "model.onnx"

model_proto, _ = tf2onnx.convert.from_keras(
    model, 
    input_signature=spec,
    opset=14,
    output_path=output_path
)

scikit-learn to ONNX

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from sklearn.ensemble import RandomForestClassifier

# Train sklearn model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Convert to ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

Next Steps

Embeddings

Use ONNX models for text embeddings

Model Evaluation

Stateless vs ranking evaluation

RAG Applications

Combine models with retrieval

Performance Tuning

Optimize model inference

Get Started

Core Concepts

Search & Query

Data Operations

Machine Learning

Configuration & Deployment

Performance & Operations

​Overview

​Adding ONNX Models

​Model Configuration

​Basic Declaration

​Input Mapping

​Output Mapping

​ONNX Runtime Configuration

​Execution Modes

​GPU Acceleration

​Using ONNX Models

​In Ranking Expressions

​With Multiple Outputs

​Stateless Evaluation API

​Model Optimization

​Model Quantization

​Model Simplification

​Dynamic Shapes

​OnnxEvaluator Interface

​Common Model Types

​Classification Models

​Reranking Models

​Embedding Models

​Troubleshooting

​Model Validation

​Inspect Model Inputs/Outputs

​Performance Issues

​Examples

​TensorFlow to ONNX

​scikit-learn to ONNX

​Next Steps

Embeddings

Model Evaluation

RAG Applications

Performance Tuning

Build docs developers (and LLMs) love

Overview

Adding ONNX Models

Model Configuration

Basic Declaration

Input Mapping

Output Mapping

ONNX Runtime Configuration

Execution Modes

GPU Acceleration

Using ONNX Models

In Ranking Expressions

With Multiple Outputs

Stateless Evaluation API

Model Optimization

Model Quantization

Model Simplification

Dynamic Shapes

OnnxEvaluator Interface

Common Model Types

Classification Models

Reranking Models

Embedding Models

Troubleshooting

Model Validation

Inspect Model Inputs/Outputs

Performance Issues

Examples

TensorFlow to ONNX

scikit-learn to ONNX

Next Steps