Overview
Deploying vLLM on Kubernetes enables scalable, production-ready LLM serving with built-in orchestration, health checks, and resource management.
Prerequisites
- Running Kubernetes cluster (v1.24+)
- kubectl configured to access your cluster
- GPU support (for GPU deployments):
CPU deployment
CPU deployment is for testing purposes only. Performance will not be comparable to GPU deployment.
Create PVC and Secret
Create storage for model cache and Hugging Face token:kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
stringData:
token: "REPLACE_WITH_TOKEN"
EOF
Deploy vLLM server
Select the appropriate image based on your CPU architecture:export VLLM_IMAGE=public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: vllm
template:
metadata:
labels:
app.kubernetes.io/name: vllm
spec:
containers:
- name: vllm
image: $VLLM_IMAGE
command: ["/bin/sh", "-c"]
args:
- "vllm serve meta-llama/Llama-3.2-1B-Instruct"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /root/.cache/huggingface
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: vllm-models
EOF
Create Service
Expose the vLLM deployment:kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app.kubernetes.io/name: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
EOF
Verify deployment
Check the logs to ensure the server started successfully:kubectl logs -l app.kubernetes.io/name=vllm
You should see:INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
GPU deployment
NVIDIA GPU deployment
Create resources
Create PVC, Secret, and Deployment:apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
stringData:
token: "REPLACE_WITH_TOKEN"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs shared memory for tensor parallel inference
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args:
- "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
The shared memory volume (/dev/shm) is essential for tensor parallel inference.
Create Service
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
Use sessionAffinity: ClientIP to enable prefix caching by routing requests from the same client to the same pod.
Deploy
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
Test the deployment
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
AMD GPU deployment
For AMD ROCm GPUs (e.g., MI300X):
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
hostNetwork: true
hostIPC: true
containers:
- name: mistral-7b
image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
securityContext:
seccompProfile:
type: Unconfined
runAsGroup: 44
capabilities:
add:
- SYS_PTRACE
command: ["/bin/sh", "-c"]
args:
- "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
amd.com/gpu: "1"
requests:
cpu: "6"
memory: 6G
amd.com/gpu: "1"
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
Multi-GPU deployment
For tensor parallel inference across multiple GPUs:
resources:
limits:
nvidia.com/gpu: "4" # 4 GPUs
requests:
nvidia.com/gpu: "4"
command:
- "vllm"
- "serve"
- "meta-llama/Meta-Llama-3-70B-Instruct"
- "--tensor-parallel-size"
- "4"
Helm deployment
For production deployments, use the official Helm chart:
helm repo add vllm https://vllm-project.github.io/vllm-helm-charts
helm repo update
helm install vllm-deployment vllm/vllm \
--set image.repository=vllm/vllm-openai \
--set image.tag=latest \
--set model.name=meta-llama/Llama-3.2-1B-Instruct \
--set resources.nvidia\.com/gpu=1
See the Helm framework guide for detailed configuration options.
Kubernetes integrations
vLLM supports deployment through various Kubernetes frameworks:
- Helm - Official Helm charts
- KServe - Model serving platform
- KubeRay - Ray on Kubernetes
- KAITO - Kubernetes AI Toolchain Operator
- KubeAI - Kubernetes AI platform
- Kthena - Multi-tenancy AI platform
Troubleshooting
Startup probe failure
If you see KeyboardInterrupt: terminated in logs:
If you see Container $NAME failed startup probe, increase the failureThreshold:
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Increase for larger models
periodSeconds: 10
failureThreshold: 30 # Increase threshold
GPU not detected
Verify GPU resources:
kubectl describe node <node-name> | grep nvidia.com/gpu
Out of memory
- Increase shared memory volume size
- Reduce batch size with
--max-num-batched-tokens
- Use quantization (INT8, FP8)
Pod stuck in pending
Check resource availability:
kubectl describe pod <pod-name>
Common issues:
- Insufficient GPU resources
- PVC not bound
- Image pull errors
Next steps