Skip to main content

Overview

Deploying vLLM on Kubernetes enables scalable, production-ready LLM serving with built-in orchestration, health checks, and resource management.

Prerequisites

CPU deployment

CPU deployment is for testing purposes only. Performance will not be comparable to GPU deployment.
1

Create PVC and Secret

Create storage for model cache and Hugging Face token:
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "REPLACE_WITH_TOKEN"
EOF
2

Deploy vLLM server

Select the appropriate image based on your CPU architecture:
export VLLM_IMAGE=public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: vllm
    spec:
      containers:
      - name: vllm
        image: $VLLM_IMAGE
        command: ["/bin/sh", "-c"]
        args:
          - "vllm serve meta-llama/Llama-3.2-1B-Instruct"
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
EOF
3

Create Service

Expose the vLLM deployment:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
EOF
4

Verify deployment

Check the logs to ensure the server started successfully:
kubectl logs -l app.kubernetes.io/name=vllm
You should see:
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

GPU deployment

NVIDIA GPU deployment

1

Create resources

Create PVC, Secret, and Deployment:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mistral-7b
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: default
  volumeMode: Filesystem
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
  namespace: default
type: Opaque
stringData:
  token: "REPLACE_WITH_TOKEN"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-7b
  namespace: default
  labels:
    app: mistral-7b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-7b
  template:
    metadata:
      labels:
        app: mistral-7b
    spec:
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: mistral-7b
      # vLLM needs shared memory for tensor parallel inference
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: mistral-7b
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args:
          - "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "10"
            memory: 20G
            nvidia.com/gpu: "1"
          requests:
            cpu: "2"
            memory: 6G
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5
The shared memory volume (/dev/shm) is essential for tensor parallel inference.
2

Create Service

apiVersion: v1
kind: Service
metadata:
  name: mistral-7b
  namespace: default
spec:
  ports:
  - name: http-mistral-7b
    port: 80
    protocol: TCP
    targetPort: 8000
  selector:
    app: mistral-7b
  sessionAffinity: None
  type: ClusterIP
Use sessionAffinity: ClientIP to enable prefix caching by routing requests from the same client to the same pod.
3

Deploy

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
4

Test the deployment

curl http://mistral-7b.default.svc.cluster.local/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
      }'

AMD GPU deployment

For AMD ROCm GPUs (e.g., MI300X):
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-7b
  namespace: default
  labels:
    app: mistral-7b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-7b
  template:
    metadata:
      labels:
        app: mistral-7b
    spec:
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: mistral-7b
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "8Gi"
      hostNetwork: true
      hostIPC: true
      containers:
      - name: mistral-7b
        image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
        securityContext:
          seccompProfile:
            type: Unconfined
          runAsGroup: 44
          capabilities:
            add:
            - SYS_PTRACE
        command: ["/bin/sh", "-c"]
        args:
          - "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "10"
            memory: 20G
            amd.com/gpu: "1"
          requests:
            cpu: "6"
            memory: 6G
            amd.com/gpu: "1"
        volumeMounts:
        - name: cache-volume
          mountPath: /root/.cache/huggingface
        - name: shm
          mountPath: /dev/shm
See the ROCm k8s-device-plugin examples for complete setup instructions.

Multi-GPU deployment

For tensor parallel inference across multiple GPUs:
resources:
  limits:
    nvidia.com/gpu: "4"  # 4 GPUs
  requests:
    nvidia.com/gpu: "4"
command:
  - "vllm"
  - "serve"
  - "meta-llama/Meta-Llama-3-70B-Instruct"
  - "--tensor-parallel-size"
  - "4"

Helm deployment

For production deployments, use the official Helm chart:
helm repo add vllm https://vllm-project.github.io/vllm-helm-charts
helm repo update

helm install vllm-deployment vllm/vllm \
  --set image.repository=vllm/vllm-openai \
  --set image.tag=latest \
  --set model.name=meta-llama/Llama-3.2-1B-Instruct \
  --set resources.nvidia\.com/gpu=1
See the Helm framework guide for detailed configuration options.

Kubernetes integrations

vLLM supports deployment through various Kubernetes frameworks:
  • Helm - Official Helm charts
  • KServe - Model serving platform
  • KubeRay - Ray on Kubernetes
  • KAITO - Kubernetes AI Toolchain Operator
  • KubeAI - Kubernetes AI platform
  • Kthena - Multi-tenancy AI platform

Troubleshooting

Startup probe failure

If you see KeyboardInterrupt: terminated in logs:
kubectl get events
If you see Container $NAME failed startup probe, increase the failureThreshold:
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120  # Increase for larger models
  periodSeconds: 10
  failureThreshold: 30  # Increase threshold

GPU not detected

Verify GPU resources:
kubectl describe node <node-name> | grep nvidia.com/gpu

Out of memory

  1. Increase shared memory volume size
  2. Reduce batch size with --max-num-batched-tokens
  3. Use quantization (INT8, FP8)

Pod stuck in pending

Check resource availability:
kubectl describe pod <pod-name>
Common issues:
  • Insufficient GPU resources
  • PVC not bound
  • Image pull errors

Next steps

Build docs developers (and LLMs) love