Online Endpoints for Real-Time Inference
Online endpoints provide real-time inference for machine learning models with low latency, automatic scaling, and built-in monitoring.Managed online endpoints handle infrastructure, scaling, and security automatically - you focus on your model.
What are Online Endpoints?
Online endpoints deploy models to web servers that return predictions via HTTP. They’re optimized for:Low Latency
Sub-second response times
Synchronous Requests
Request-response pattern
Real-Time Scoring
Immediate predictions
Auto Scaling
Handle traffic spikes
When to Use Online Endpoints
Choose online endpoints when:Managed Online Endpoints
Recommended deployment method with full infrastructure management.Key Features
- Infrastructure
- Scaling
- Monitoring
- Security
Fully Managed:
- Automatic compute provisioning
- OS updates and patching
- Node recovery on failure
- Load balancing
- SSL termination
Create Online Endpoint
Step 1: Define Endpoint
Step 2: Create Deployment
Step 3: Route Traffic
Scoring Script
For custom models, provide a scoring script:Invoke Endpoint
Using Python SDK
Using REST API
Using cURL
Traffic Management
Mirror Traffic (Shadow Testing)
Test new deployment without affecting production:Gradual Rollout
Autoscaling Configuration
Monitoring and Logging
View Deployment Logs
Query Metrics
Key Metrics
| Metric | Description | Threshold |
|---|---|---|
| Request Latency (P95) | 95th percentile response time | <500ms |
| Requests Per Minute | Throughput | - |
| HTTP 2xx Rate | Success rate | >99% |
| HTTP 4xx Rate | Client errors | <1% |
| HTTP 5xx Rate | Server errors | <0.1% |
| CPU Utilization | Compute usage | <80% |
| Memory Utilization | RAM usage | <80% |
| Instance Count | Active instances | - |
Security Best Practices
Use Managed Identity
Use Managed Identity
Enable system-assigned identity for secure access:
Enable Private Endpoints
Enable Private Endpoints
Disable public access for sensitive workloads:
Rotate Keys Regularly
Rotate Keys Regularly
Use Customer-Managed Keys
Use Customer-Managed Keys
Encrypt data at rest with your own keys:
Performance Optimization
- Model Optimization
- Batch Predictions
- Caching
- GPU Acceleration
- Convert to ONNX format
- Apply quantization
- Prune unnecessary layers
- Use model distillation
Cost Management
Next Steps
Batch Endpoints
Deploy for large-scale batch processing
Monitor Endpoints
Set up monitoring and alerts
MLOps
Automate deployment workflows
Troubleshooting
Debug deployment issues