Deprecation Notice: ONNX Runtime Server has been deprecated and is no longer actively maintained. For production deployments, consider alternatives like:
- Triton Inference Server with ONNX Runtime backend
- Custom REST APIs using ONNX Runtime SDKs
- Cloud-native solutions (Azure ML, AWS SageMaker, etc.)
Overview
ONNX Runtime Server provided an easy way to start an inferencing server with both HTTP and GRPC endpoints. While deprecated, this documentation is maintained for reference.Building ONNX Runtime Server
Prerequisites
- golang
- grpc
- re2
- cmake
- gcc and g++
- ONNX Runtime C API binaries from GitHub releases
Build Instructions (Linux)
With rsyslog Support
Using Build Script
Starting the Server
Basic Usage
Command Line Options
Example
HTTP Endpoint
Prediction URL Format
Request and Response Format
Requests and responses use Protocol Buffers format. The protobuf definition is available inserver/protobuf/predict.proto.
Content Types
Request Headers
TheContent-Type header is required:
application/json- JSON format (UTF-8)application/vnd.google.protobuf- Binary protobufapplication/x-protobuf- Binary protobufapplication/octet-stream- Binary protobuf
Response Format
Set theAccept header to control response format:
- Same options as
Content-Type - Defaults to request content type if not specified
Making HTTP Requests
Using cURL (JSON)
Using cURL (Binary)
Using Python
GRPC Endpoint
Protobuf Definition
The GRPC service definition is available inserver/protobuf/prediction_service.proto.
Python GRPC Client
Advanced Configuration
Number of Worker Threads
Control server utilization with worker threads:Request Tracking Headers
The server provides headers for request tracking:x-ms-request-id: Server-generated GUID for each request (e.g.,72b68108-18a4-493c-ac75-d0abd82f0a11)x-ms-client-request-id: Client-provided ID that persists in response
Example
rsyslog Integration
If built with rsyslog support:/etc/rsyslog.conf or /etc/rsyslog.d/.
Production Deployment
Docker Deployment
Kubernetes Deployment
Load Balancing
Use nginx for load balancing:Performance Tuning
Thread Configuration
Model Optimization
- Convert to ORT format for faster loading
- Use graph optimization level ‘all’
- Consider quantization for INT8 inference
Monitoring and Debugging
Health Check Endpoint
Implement custom health checks:Logging Levels
Migration Guide
Moving to Triton Inference Server
Triton supports ONNX Runtime as a backend:- Install Triton: Use official Docker images
- Configure model repository:
- config.pbtxt:
- Start Triton: