Overview
KServe provides a standardized, serverless inference platform built on Kubernetes. It offers automatic scaling, canary deployments, and seamless integration with Istio for traffic management.Architecture
KServe uses a two-component architecture:- InferenceService: Kubernetes CRD defining the model serving configuration
- Custom Model: Python implementation of the prediction logic
Custom Model Implementation
Model Class
The custom model (serving/kserve_api.py) extends KServe’s base Model class:
serving/kserve_api.py
__init__: Initialize model name and trigger loadingload: Download model from registry and set ready statepredict: Handle inference requests with standard payload format
Lifecycle Management
Request/Response Protocol
V1 Inference Protocol
KServe uses a standardized format:instances: List of input data (any JSON-serializable type)predictions: List of outputs matching input order
Input Parsing
- Strings:
["text1", "text2"] - Numbers:
[[1, 2, 3], [4, 5, 6]] - Objects:
[{"text": "...", "id": 1}]
Kubernetes Deployment
InferenceService Manifest
k8s/kserve-inferenceserver.yaml
InferenceService: Custom Resource Definition (CRD)predictor: Container running the model serverenv: Environment variables (secrets, config)
Installation
Install KServe
- KServe operator
- Istio for traffic management
- Knative Serving for autoscaling
- Cert-manager for TLS
Accessing the Service
Port Forwarding
Making Requests
Hostheader: Routes to correct InferenceService- URL path:
/v1/models/{model-name}:predict - Input file:
kserve-input.jsonwith instances
Advanced Features
Autoscaling
KServe automatically scales based on request load:- Scales to zero when idle (after 60s by default)
- Scales up based on concurrent requests
- Cold start latency: 5-15 seconds
Canary Deployments
Deploy new model versions with traffic splitting:- A/B testing model versions
- Gradual rollout of new models
- Risk mitigation for model updates
Transformer (Preprocessing)
Add preprocessing before prediction:Explainer (Post-processing)
Add model explanations:Monitoring and Logging
View Logs
Metrics
KServe exposes Prometheus metrics:request_total: Total requestsrequest_duration_seconds: Latency distributionrequest_failure_total: Failed requests
Health Checks
KServe provides built-in endpoints:Local Development
Build and Run
Troubleshooting
InferenceService not ready
InferenceService not ready
Problem:
kubectl get isvc shows Unknown or FalseSolutions:404 errors on requests
404 errors on requests
Problem: Requests fail with 404 Not FoundSolutions:
- Verify Host header matches service name
- Check Istio gateway is running
- Ensure URL path is correct:
/v1/models/{name}:predict - Test with verbose curl:
curl -v
Slow cold starts
Slow cold starts
Problem: First request takes >30 secondsSolutions:
- Set
minReplicas: 1to prevent scale-to-zero - Use init containers for model download
- Cache model in persistent volume
- Optimize image size
Comparison: KServe vs Alternatives
| Feature | KServe | Seldon Core | BentoML |
|---|---|---|---|
| Kubernetes native | Yes | Yes | Partial |
| Autoscaling | Excellent | Good | Limited |
| Multi-framework | Yes | Yes | Yes |
| Canary deployments | Built-in | Via Istio | Manual |
| Complexity | Medium | High | Low |
| Community | Large | Large | Growing |
- Running on Kubernetes
- Need autoscaling and canary deployments
- Want standardized inference protocol
- Using Istio service mesh
Best Practices
Resource Limits
Set appropriate CPU/memory limits to prevent OOM
Model Caching
Use persistent volumes for faster restarts
Health Checks
Implement comprehensive health checks in
load()Monitoring
Export custom metrics for model-specific monitoring
Production Checklist
- Configure resource requests/limits
- Set up persistent volume for model cache
- Enable Prometheus metrics scraping
- Configure HPA for autoscaling
- Set up logging aggregation
- Implement request timeouts
- Add authentication/authorization
- Configure TLS certificates
Next Steps
vLLM Serving
Serve large language models with vLLM and LoRA adapters