Kubernetes Deployment

Overview

Kubernetes provides robust orchestration for SGLang deployments, enabling auto-scaling, self-healing, and declarative configuration. This guide covers single-node and distributed multi-node deployments.

Prerequisites

Kubernetes cluster version ≥1.26
NVIDIA GPU Operator or device plugin installed
kubectl configured for cluster access
Storage class for persistent volumes (for model caching)

GPU Support Setup

Install NVIDIA device plugin if not already available:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Verify GPU nodes:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Single-Node Deployment

Basic Deployment

Deploy a single-replica SGLang server:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llama-31-8b-sglang
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 30Gi
  storageClassName: default # change this to your preferred storage class
  volumeMode: Filesystem
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: meta-llama-31-8b-instruct-sglang
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: meta-llama-31-8b-instruct-sglang
  template:
    metadata:
      labels:
        app: meta-llama-31-8b-instruct-sglang
        model: meta-llama-31-8b-instruct
        engine: sglang
    spec:
      restartPolicy: Always
      runtimeClassName: nvidia
      containers:
        - name: meta-llama-31-8b-instruct-sglang
          image: docker.io/lmsysorg/sglang:latest
          imagePullPolicy: Always # IfNotPresent or Never
          ports:
            - containerPort: 30000
          command: ["python3", "-m", "sglang.launch_server"]
          args:
            [
              "--model-path",
              "meta-llama/Llama-3.1-8B-Instruct",
              "--host",
              "0.0.0.0",
              "--port",
              "30000",
            ]
          env:
            - name: HF_TOKEN
              value: <secret>
          resources:
            limits:
              nvidia.com/gpu: 1
              cpu: 8
              memory: 40Gi
            requests:
              cpu: 2
              memory: 16Gi
              nvidia.com/gpu: 1
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
            - name: hf-cache
              mountPath: /root/.cache/huggingface
            - name: localtime
              mountPath: /etc/localtime
              readOnly: true
          livenessProbe:
            httpGet:
              path: /health
              port: 30000
            initialDelaySeconds: 120
            periodSeconds: 15
            timeoutSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health_generate
              port: 30000
            initialDelaySeconds: 120
            periodSeconds: 15
            timeoutSeconds: 10
            failureThreshold: 3
            successThreshold: 1
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 10Gi
        - name: hf-cache
          persistentVolumeClaim:
            claimName: llama-31-8b-sglang
        - name: localtime
          hostPath:
            path: /etc/localtime
            type: File
---
apiVersion: v1
kind: Service
metadata:
  name: meta-llama-31-8b-instruct-sglang
spec:
  selector:
    app: meta-llama-31-8b-instruct-sglang
  ports:
    - protocol: TCP
      port: 80 # port on host
      targetPort: 30000 # port in container
  type: LoadBalancer # change to ClusterIP if needed

Save as sglang-deployment.yaml and apply:

kubectl apply -f sglang-deployment.yaml

Verify Deployment

# Check pod status
kubectl get pods -l app=meta-llama-31-8b-instruct-sglang

# View logs
kubectl logs -f deployment/meta-llama-31-8b-instruct-sglang

# Check service endpoint
kubectl get svc meta-llama-31-8b-instruct-sglang

Multi-Node Distributed Deployment

Using StatefulSet

For multi-node tensor parallelism across nodes:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: distributed-sglang
spec:
  replicas: 2   # number of nodes/pods to run distributed sglang
  selector:
    matchLabels:
      app: distributed-sglang
  serviceName: ""
  template:
    metadata:
      labels:
        app: distributed-sglang
    spec:
      containers:
      - name: sglang-container
        image: docker.io/lmsysorg/sglang:latest
        imagePullPolicy: Always
        command:
        - /bin/bash
        - -c
        args:
        - |
          python3 -m sglang.launch_server \
          --model /llm-folder \
          --dist-init-addr sglang-master-pod:5000 \
          --tensor-parallel-size 16 \
          --nnodes 2 \
          --node-rank $POD_INDEX \
          --trust-remote-code \
          --host 0.0.0.0 \
          --port 8000 \
          --enable-metrics \
          --expert-parallel-size 16
        env:
        - name: POD_INDEX     # reflects the node-rank
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
        - name: NCCL_DEBUG
          value: INFO
        resources:
          limits:
            nvidia.com/gpu: "8"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - mountPath: /llm-folder
          name: llm
        securityContext:
          privileged: true   # to leverage RDMA/InfiniBand device
      hostNetwork: true
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 10Gi
        name: dshm
      - hostPath:
          path: /llm-folder # replace with PVC or hostPath with your model weights
          type: DirectoryOrCreate
        name: llm
---
apiVersion: v1
kind: Service
metadata:
  name: sglang-master-pod
spec:
  type: ClusterIP
  selector:
    app: distributed-sglang
    apps.kubernetes.io/pod-index: "0"
  ports:
  - name: dist-port
    port: 5000
    targetPort: 5000
---
apiVersion: v1
kind: Service
metadata:
  name: sglang-serving-on-master
spec:
  type: NodePort
  selector:
    app: distributed-sglang
    apps.kubernetes.io/pod-index: "0"
  ports:
  - name: serving
    port: 8000
    targetPort: 8000
  - name: metrics
    port: 8080
    targetPort: 8080

Apply the configuration:

kubectl apply -f distributed-sglang.yaml

LeaderWorkerSet (LWS) Deployment

LeaderWorkerSet is the recommended approach for multi-node distributed inference.

Prerequisites

Install LeaderWorkerSet controller:

kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/v0.6.0/manifests.yaml

Verify installation:

kubectl get crd leaderworkersets.leaderworkerset.x-k8s.io

Basic LWS Configuration

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: sglang
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true
        hostIPC: true
        containers:
          - name: sglang-leader
            image: lmsysorg/sglang:latest
            securityContext:
              privileged: true
            env:
              - name: NCCL_IB_GID_INDEX
                value: "3"
            command:
              - python3
              - -m
              - sglang.launch_server
              - --model-path
              - /work/models
              - --tp
              - "16"
              - --dist-init-addr
              - $(LWS_LEADER_ADDRESS):20000
              - --nnodes
              - $(LWS_GROUP_SIZE)
              - --node-rank
              - $(LWS_WORKER_INDEX)
              - --trust-remote-code
              - --host
              - "0.0.0.0"
              - --port
              - "40000"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 40000
            readinessProbe:
              tcpSocket:
                port: 40000
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - name: model
                mountPath: /work/models
              - name: ib
                mountPath: /dev/infiniband
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: model
            hostPath:
              path: /path/to/models
          - name: ib
            hostPath:
              path: /dev/infiniband
    workerTemplate:
      spec:
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true
        hostIPC: true
        containers:
          - name: sglang-worker
            image: lmsysorg/sglang:latest
            securityContext:
              privileged: true
            env:
            - name: NCCL_IB_GID_INDEX
              value: "3"
            command:
              - python3
              - -m
              - sglang.launch_server
              - --model-path
              - /work/models
              - --tp
              - "16"
              - --dist-init-addr
              - $(LWS_LEADER_ADDRESS):20000
              - --nnodes
              - $(LWS_GROUP_SIZE)
              - --node-rank
              - $(LWS_WORKER_INDEX)
              - --trust-remote-code
            resources:
              limits:
                nvidia.com/gpu: "8"
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - name: model
                mountPath: /work/models
              - name: ib
                mountPath: /dev/infiniband
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
          - name: ib
            hostPath:
              path: /dev/infiniband
          - name: model
            hostPath:
              path: /path/to/models
---
apiVersion: v1
kind: Service
metadata:
  name: sglang-leader
spec:
  selector:
    leaderworkerset.sigs.k8s.io/name: sglang
    role: leader
  ports:
    - protocol: TCP
      port: 40000
      targetPort: 40000

Deploy with LWS

kubectl apply -f sglang-lws.yaml

Monitor LWS Deployment

# Check LeaderWorkerSet status
kubectl get lws sglang

# View all pods
kubectl get pods -l leaderworkerset.sigs.k8s.io/name=sglang

# Check leader logs
kubectl logs -f sglang-0

# Check worker logs
kubectl logs -f sglang-0-1

RDMA/InfiniBand Configuration

For high-performance multi-node setups with RDMA:

Prerequisites

Verify InfiniBand devices on nodes:

ibstatus
ibdev2netdev

Check RDMA accessibility:

rdma link show

RDMA-Enabled Deployment

spec:
  template:
    spec:
      hostNetwork: true
      hostIPC: true
      containers:
      - name: sglang
        securityContext:
          privileged: true
          capabilities:
            add:
            - IPC_LOCK
        env:
        - name: NCCL_IB_GID_INDEX
          value: "3"
        - name: NCCL_IB_QPS_PER_CONNECTION
          value: "8"
        - name: NCCL_IB_SPLIT_DATA_ON_QPS
          value: "1"
        - name: NCCL_NET_PLUGIN
          value: "none"
        - name: NCCL_IB_HCA
          value: "^=mlx5_0,mlx5_5,mlx5_6"
        - name: NCCL_DEBUG
          value: "INFO"
        volumeMounts:
        - name: ib
          mountPath: /dev/infiniband
      volumes:
      - name: ib
        hostPath:
          path: /dev/infiniband

Storage Configuration

Persistent Volume for Model Cache

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
spec:
  accessModes:
    - ReadWriteMany  # Required for multi-pod access
  resources:
    requests:
      storage: 100Gi
  storageClassName: nfs-storage  # Use appropriate storage class

Using hostPath (Development)

volumes:
- name: model-cache
  hostPath:
    path: /data/models
    type: DirectoryOrCreate

Using NFS (Production)

volumes:
- name: model-cache
  nfs:
    server: nfs-server.example.com
    path: /exports/models

Resource Management

Resource Requests and Limits

resources:
  requests:
    cpu: "4"
    memory: "32Gi"
    nvidia.com/gpu: "1"
  limits:
    cpu: "8"
    memory: "64Gi"
    nvidia.com/gpu: "1"

Node Selection

nodeSelector:
  gpu-type: nvidia-a100
  node-role: inference

Tolerations

tolerations:
- key: "nvidia.com/gpu"
  operator: "Exists"
  effect: "NoSchedule"

Monitoring and Observability

Enable Metrics

Add metrics endpoint to your deployment:

args:
  - --enable-metrics
  - --metrics-port
  - "8080"

Prometheus Integration

apiVersion: v1
kind: Service
metadata:
  name: sglang-metrics
  labels:
    app: sglang
spec:
  selector:
    app: sglang
  ports:
  - name: metrics
    port: 8080
    targetPort: 8080
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sglang-metrics
spec:
  selector:
    matchLabels:
      app: sglang
  endpoints:
  - port: metrics
    interval: 30s

Scaling

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sglang-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sglang-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

Troubleshooting

Pod Stuck in Pending

# Check events
kubectl describe pod <pod-name>

# Check GPU availability
kubectl describe nodes | grep -A 5 "Allocated resources"

NCCL Communication Failures

# Enable NCCL debug logs
env:
- name: NCCL_DEBUG
  value: "TRACE"

# Check network connectivity between pods
kubectl exec -it <pod-name> -- ping <other-pod-ip>

RDMA Issues

# Verify RDMA devices in container
kubectl exec -it <pod-name> -- ibv_devices
kubectl exec -it <pod-name> -- ibv_devinfo

# Check RDMA link status
kubectl exec -it <pod-name> -- rdma link show

Out of Memory

# Increase shared memory
volumes:
- name: dshm
  emptyDir:
    medium: Memory
    sizeLimit: 20Gi  # Increase as needed

Best Practices

Use StatefulSet for multi-node: StatefulSets provide stable network identities
Enable hostNetwork for RDMA: Required for high-performance inter-node communication
Set privileged mode for InfiniBand: Necessary for RDMA device access
Use ReadWriteMany PVCs: Enable model sharing across pods
Configure health probes: Implement both liveness and readiness probes
Set resource limits: Prevent resource contention
Use specific image tags: Avoid latest in production
Monitor NCCL environment: Tune based on network topology

Next Steps

Multi-Node Configuration - Advanced multi-node setups
Cloud Platforms - Managed Kubernetes on cloud providers
Docker Deployment - Container-based deployment

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Prerequisites

​GPU Support Setup

​Single-Node Deployment

​Basic Deployment

​Verify Deployment

​Multi-Node Distributed Deployment

​Using StatefulSet

​LeaderWorkerSet (LWS) Deployment

​Prerequisites

​Basic LWS Configuration

​Deploy with LWS

​Monitor LWS Deployment

​RDMA/InfiniBand Configuration

​Prerequisites

​RDMA-Enabled Deployment

​Storage Configuration

​Persistent Volume for Model Cache

​Using hostPath (Development)

​Using NFS (Production)

​Resource Management

​Resource Requests and Limits

​Node Selection

​Tolerations

​Monitoring and Observability

​Enable Metrics

​Prometheus Integration

​Scaling

​Horizontal Pod Autoscaler

​Troubleshooting

​Pod Stuck in Pending

​NCCL Communication Failures

​RDMA Issues

​Out of Memory

​Best Practices

​Next Steps

Overview

Prerequisites

GPU Support Setup

Single-Node Deployment

Basic Deployment

Verify Deployment

Multi-Node Distributed Deployment

Using StatefulSet

LeaderWorkerSet (LWS) Deployment

Prerequisites

Basic LWS Configuration

Deploy with LWS

Monitor LWS Deployment

RDMA/InfiniBand Configuration

Prerequisites

RDMA-Enabled Deployment

Storage Configuration

Persistent Volume for Model Cache

Using hostPath (Development)

Using NFS (Production)

Resource Management

Resource Requests and Limits

Node Selection

Tolerations

Monitoring and Observability

Enable Metrics

Prometheus Integration

Scaling

Horizontal Pod Autoscaler

Troubleshooting

Pod Stuck in Pending

NCCL Communication Failures

RDMA Issues

Out of Memory

Best Practices

Next Steps