Skip to main content

Overview

KServe provides a standardized, serverless inference platform built on Kubernetes. It offers automatic scaling, canary deployments, and seamless integration with Istio for traffic management.

Architecture

KServe uses a two-component architecture:
  1. InferenceService: Kubernetes CRD defining the model serving configuration
  2. Custom Model: Python implementation of the prediction logic

Custom Model Implementation

Model Class

The custom model (serving/kserve_api.py) extends KServe’s base Model class:
serving/kserve_api.py
from kserve import Model, ModelServer
from serving.predictor import Predictor
from typing import Dict

class CustomModel(Model):
    def __init__(self, name: str):
        super().__init__(name)
        self.name = name
        self.load()

    def load(self):
        self.predictor = Predictor.default_from_model_registry()
        self.ready = True

    def predict(self, payload: Dict, headers: Dict[str, str] = None) -> Dict:
        print(payload)
        print(type(payload))
        instances = payload["instances"]
        predictions = self.predictor.predict(instances)
        return {"predictions": predictions.tolist()}

if __name__ == "__main__":
    model = CustomModel("custom-model")
    ModelServer().start([model])
Key methods:
  • __init__: Initialize model name and trigger loading
  • load: Download model from registry and set ready state
  • predict: Handle inference requests with standard payload format

Lifecycle Management

1

Initialization

__init__ is called when the container starts
2

Model Loading

load() downloads model from W&B and sets self.ready = True
3

Health Checks

KServe checks self.ready for liveness/readiness probes
4

Request Handling

predict() is called for each inference request

Request/Response Protocol

V1 Inference Protocol

KServe uses a standardized format:
{
  "instances": ["good", "bad123"]
}
Protocol specification:
  • instances: List of input data (any JSON-serializable type)
  • predictions: List of outputs matching input order

Input Parsing

def predict(self, payload: Dict, headers: Dict[str, str] = None) -> Dict:
    instances = payload["instances"]
    predictions = self.predictor.predict(instances)
    return {"predictions": predictions.tolist()}
Supported formats:
  • Strings: ["text1", "text2"]
  • Numbers: [[1, 2, 3], [4, 5, 6]]
  • Objects: [{"text": "...", "id": 1}]

Kubernetes Deployment

InferenceService Manifest

k8s/kserve-inferenceserver.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model
spec:
  predictor:
    containers:
      - name: kserve-container
        image: ghcr.io/kyryl-opens-ml/app-kserve:latest
        env:
        - name: WANDB_API_KEY
          valueFrom:
            secretKeyRef:
              name: wandb
              key: WANDB_API_KEY
Manifest structure:
  • InferenceService: Custom Resource Definition (CRD)
  • predictor: Container running the model server
  • env: Environment variables (secrets, config)

Installation

1

Install KServe

curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.13/hack/quick_install.sh" | bash
This installs:
  • KServe operator
  • Istio for traffic management
  • Knative Serving for autoscaling
  • Cert-manager for TLS
2

Verify installation

kubectl get pods -n kserve
kubectl get pods -n istio-system
3

Create secrets

export WANDB_API_KEY='your-key'
kubectl create secret generic wandb \
  --from-literal=WANDB_API_KEY=$WANDB_API_KEY
4

Deploy model

kubectl create -f k8s/kserve-inferenceserver.yaml
5

Check status

kubectl get inferenceservices
kubectl get pods -l serving.kserve.io/inferenceservice=custom-model

Accessing the Service

Port Forwarding

kubectl port-forward --namespace istio-system \
  svc/istio-ingressgateway 8080:80

Making Requests

curl -v \
  -H "Host: custom-model.default.example.com" \
  -H "Content-Type: application/json" \
  "http://localhost:8080/v1/models/custom-model:predict" \
  -d @data-samples/kserve-input.json
Request components:
  • Host header: Routes to correct InferenceService
  • URL path: /v1/models/{model-name}:predict
  • Input file: kserve-input.json with instances
Expected response:
{
  "predictions": [
    [0.23, 0.77],
    [0.89, 0.11]
  ]
}

Advanced Features

Autoscaling

KServe automatically scales based on request load:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 10  # Concurrent requests per pod
    containers:
      - name: kserve-container
        image: ghcr.io/kyryl-opens-ml/app-kserve:latest
Scaling behavior:
  • Scales to zero when idle (after 60s by default)
  • Scales up based on concurrent requests
  • Cold start latency: 5-15 seconds

Canary Deployments

Deploy new model versions with traffic splitting:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model
spec:
  predictor:
    containers:
      - name: kserve-container
        image: ghcr.io/kyryl-opens-ml/app-kserve:v2
    canaryTrafficPercent: 20  # 20% to new version
Use cases:
  • A/B testing model versions
  • Gradual rollout of new models
  • Risk mitigation for model updates

Transformer (Preprocessing)

Add preprocessing before prediction:
spec:
  transformer:
    containers:
      - name: transformer
        image: ghcr.io/kyryl-opens-ml/transformer:latest
  predictor:
    containers:
      - name: kserve-container
        image: ghcr.io/kyryl-opens-ml/app-kserve:latest
Transformer implementation:
class CustomTransformer(Model):
    def preprocess(self, payload: Dict, headers: Dict = None) -> Dict:
        # Clean and tokenize text
        instances = payload["instances"]
        cleaned = [clean_text(text) for text in instances]
        return {"instances": cleaned}

Explainer (Post-processing)

Add model explanations:
spec:
  predictor:
    containers:
      - name: kserve-container
        image: ghcr.io/kyryl-opens-ml/app-kserve:latest
  explainer:
    containers:
      - name: explainer
        image: ghcr.io/kyryl-opens-ml/explainer:latest

Monitoring and Logging

View Logs

# Get pod name
kubectl get pods -l serving.kserve.io/inferenceservice=custom-model

# View logs
kubectl logs <pod-name> kserve-container -f

Metrics

KServe exposes Prometheus metrics:
# Port forward to metrics endpoint
kubectl port-forward <pod-name> 9090:9090

# Query metrics
curl http://localhost:9090/metrics
Key metrics:
  • request_total: Total requests
  • request_duration_seconds: Latency distribution
  • request_failure_total: Failed requests

Health Checks

KServe provides built-in endpoints:
# Liveness probe
curl http://localhost:8080/v1/models/custom-model

# Readiness probe (checks self.ready)
curl http://localhost:8080/v1/models/custom-model/ready

Local Development

Build and Run

# Build
make build_kserve

# Run locally
make run_kserve
This starts the server on port 8081:
# Test locally
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"instances": ["test"]}' \
  http://localhost:8081/v1/models/custom-model:predict

Troubleshooting

Problem: kubectl get isvc shows Unknown or FalseSolutions:
# Check events
kubectl describe inferenceservice custom-model

# Check pod status
kubectl get pods -l serving.kserve.io/inferenceservice=custom-model
kubectl logs <pod-name> -c kserve-container

# Common issues:
# - Missing secrets (WANDB_API_KEY)
# - Image pull errors
# - Model loading failures
Problem: Requests fail with 404 Not FoundSolutions:
  • Verify Host header matches service name
  • Check Istio gateway is running
  • Ensure URL path is correct: /v1/models/{name}:predict
  • Test with verbose curl: curl -v
Problem: First request takes >30 secondsSolutions:
  • Set minReplicas: 1 to prevent scale-to-zero
  • Use init containers for model download
  • Cache model in persistent volume
  • Optimize image size

Comparison: KServe vs Alternatives

FeatureKServeSeldon CoreBentoML
Kubernetes nativeYesYesPartial
AutoscalingExcellentGoodLimited
Multi-frameworkYesYesYes
Canary deploymentsBuilt-inVia IstioManual
ComplexityMediumHighLow
CommunityLargeLargeGrowing
Choose KServe when:
  • Running on Kubernetes
  • Need autoscaling and canary deployments
  • Want standardized inference protocol
  • Using Istio service mesh

Best Practices

Resource Limits

Set appropriate CPU/memory limits to prevent OOM

Model Caching

Use persistent volumes for faster restarts

Health Checks

Implement comprehensive health checks in load()

Monitoring

Export custom metrics for model-specific monitoring

Production Checklist

  • Configure resource requests/limits
  • Set up persistent volume for model cache
  • Enable Prometheus metrics scraping
  • Configure HPA for autoscaling
  • Set up logging aggregation
  • Implement request timeouts
  • Add authentication/authorization
  • Configure TLS certificates

Next Steps

vLLM Serving

Serve large language models with vLLM and LoRA adapters

Resources

Build docs developers (and LLMs) love