Skip to main content

Overview

Multi-node deployment enables running large language models that require more GPUs than available on a single node. The NVIDIA NIM Operator uses MPI (Message Passing Interface) and LeaderWorkerSet to orchestrate distributed model serving across multiple GPU nodes.

What is Multi-Node NIM?

Multi-node NIM allows you to:
  • Deploy models requiring 8+ GPUs across multiple nodes
  • Leverage tensor and pipeline parallelism for distributed inference
  • Scale beyond single-node GPU limits
  • Optimize large model performance with MPI-based communication

Architecture

The leader pod handles API requests and coordinates inference across all worker pods using MPI for inter-process communication.

Prerequisites

1

LeaderWorkerSet CRD

Install the LeaderWorkerSet operator in your cluster:
kubectl apply -f https://github.com/kubernetes-sigs/lws/releases/download/v0.4.0/manifests.yaml
2

GPU Nodes

Ensure you have multiple nodes with GPUs available. For optimal performance:
  • Use nodes with identical GPU types
  • Enable high-speed networking (RDMA preferred)
  • Configure GPU drivers and NVIDIA device plugin
3

Storage with RWX Support

Multi-node deployments require ReadWriteMany (RWX) storage for shared model access:
  • NFS
  • CephFS
  • Cloud storage (EFS, Filestore, Azure Files)
4

Network Configuration

Configure network for MPI communication:
  • Allow pod-to-pod communication across nodes
  • Open required MPI ports
  • For RDMA: Configure InfiniBand or RoCE

Basic Multi-Node Configuration

Minimal Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: deepseek-r1-nimcache
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
  storage:
    pvc:
      create: true
      storageClass: ''  # RWX-capable storage class
      size: "100Gi"
      volumeAccessMode: ReadWriteMany

---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: deepseek-r1
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/deepseek-ai/deepseek-r1
    tag: "1.7.3"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: deepseek-r1-nimcache
  replicas: 1  # Number of leader-worker groups
  resources:
    limits:
      nvidia.com/gpu: 8  # GPUs per pod
    requests:
      nvidia.com/gpu: 8
  expose:
    service:
      type: ClusterIP
      port: 8000
  multiNode:
    parallelism:
      pipeline: 2  # 2 pipeline stages
      tensor: 8    # 8-way tensor parallelism
    mpi:
      mpiStartTimeout: 6000

Multi-Node Configuration

spec.multiNode
object
Multi-node deployment configuration. When set, autoscaling must be disabled.
  • Autoscaling (spec.scale.enabled) cannot be used with multi-node configuration
  • When spec.multiNode is set, spec.replicas is not allowed

Understanding Parallelism

Tensor Parallelism

Splits individual layers across multiple GPUs for parallel computation within a single forward pass.
multiNode:
  parallelism:
    tensor: 8  # Uses 8 GPUs in parallel per pipeline stage
Total GPUs per pod = tensor parallelism Example: tensor: 8 → Each pod uses 8 GPUs

Pipeline Parallelism

Splits the model into stages, where each stage runs on a different node/pod.
multiNode:
  parallelism:
    pipeline: 2  # Model split into 2 stages
    tensor: 8    # Each stage uses 8 GPUs
Total pods in cluster = pipeline parallelism
Total GPUs = pipeline × tensor
Example: pipeline: 2, tensor: 8 → 2 pods, 16 GPUs total

Example Configurations

multiNode:
  parallelism:
    pipeline: 2  # 2 nodes
    tensor: 8    # 8 GPUs per node
  # Total: 2 nodes × 8 GPUs = 16 GPUs

Complete Multi-Node Example

Here’s a production-ready configuration with all recommended settings:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: deepseek-r1
  namespace: nim-service
spec:
  # Environment configuration for multi-node
  env:
  - name: NIM_USE_SGLANG
    value: "1"
  - name: HF_HOME
    value: /model-store/huggingface/hub
  - name: NUMBA_CACHE_DIR
    value: /tmp/numba
  # Network transport configuration
  - name: UCX_TLS
    value: ib,tcp,shm  # InfiniBand, TCP, shared memory
  - name: UCC_TLS
    value: ucp
  - name: UCC_CONFIG_FILE
    value: " "
  - name: GLOO_SOCKET_IFNAME
    value: eth0  # Primary network interface
  - name: NCCL_SOCKET_IFNAME
    value: eth0  # NCCL network interface
  - name: NIM_TRUST_CUSTOM_CODE
    value: "1"
  
  # Extended health probes for large model initialization
  readinessProbe:
    probe:
      failureThreshold: 3
      httpGet:
        path: "/v1/health/ready"
        port: "api"
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
  
  startupProbe:
    probe:
      failureThreshold: 100
      httpGet:
        path: "/v1/health/ready"
        port: "api"
      initialDelaySeconds: 900  # 15 minutes
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
  
  # Container image
  image:
    repository: nvcr.io/nim/deepseek-ai/deepseek-r1
    tag: "1.7.3"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  authSecret: ngc-api-secret
  
  # Storage with RWX access
  storage:
    nimCache:
      name: deepseek-r1-nimcache
  
  replicas: 1
  
  # Resource allocation
  resources:
    limits:
      nvidia.com/gpu: 8
    requests:
      nvidia.com/gpu: 8
      cpu: "32"
      memory: 256Gi
  
  # Service exposure
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  # Multi-node configuration
  multiNode:
    backendType: lws
    parallelism:
      pipeline: 2  # 2-stage pipeline
      tensor: 8    # 8-way tensor parallelism
    mpi:
      mpiStartTimeout: 6000  # 100 minutes for large model

RDMA-Enabled Multi-Node

For optimal performance with high-speed networking:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: deepseek-r1-rdma
  namespace: nim-service
spec:
  env:
  - name: UCX_TLS
    value: rc,shm  # Reliable Connection (RDMA) + shared memory
  - name: NCCL_IB_DISABLE
    value: "0"  # Enable InfiniBand for NCCL
  - name: NCCL_NET
    value: IB  # Use InfiniBand network
  - name: NCCL_SOCKET_IFNAME
    value: ib0  # InfiniBand interface
  
  image:
    repository: nvcr.io/nim/deepseek-ai/deepseek-r1
    tag: "1.7.3"
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  
  storage:
    nimCache:
      name: deepseek-r1-nimcache
  
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 8
      rdma/rdma_shared_device_a: 1  # RDMA device
    requests:
      nvidia.com/gpu: 8
      rdma/rdma_shared_device_a: 1
  
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  multiNode:
    parallelism:
      pipeline: 2
      tensor: 8
    mpi:
      mpiStartTimeout: 6000
RDMA significantly reduces latency for MPI communication, improving inference performance for multi-node deployments.

Resource Requirements

Storage

volumeAccessMode
default:"ReadWriteMany"
Multi-node deployments require ReadWriteMany access mode for shared model access across nodes.
storage:
  pvc:
    create: true
    size: "100Gi"
    volumeAccessMode: ReadWriteMany  # Required for multi-node
    storageClass: nfs-storage  # Must support RWX

Compute Resources

Recommended resource allocation per pod:
resources:
  limits:
    nvidia.com/gpu: 8
    cpu: "16"
    memory: 128Gi
  requests:
    nvidia.com/gpu: 8
    cpu: "8"
    memory: 64Gi

MPI Configuration Details

Environment Variables

The operator automatically sets these MPI-related environment variables:
NIM_MULTI_NODE=1
NIM_NUM_COMPUTE_NODES=2
NIM_TENSOR_PARALLEL_SIZE=8
NIM_PIPELINE_PARALLEL_SIZE=2
NIM_NODE_RANK=0
NIM_LEADER_ROLE=1
OMPI_MCA_orte_keep_fqdn_hostnames=true
OMPI_MCA_plm_rsh_args="-o ConnectionAttempts=20"
GPUS_PER_NODE=8
CLUSTER_START_TIMEOUT=6000

SSH Configuration

The operator automatically configures SSH for MPI:
  • Generates SSH key pairs
  • Configures passwordless SSH between leader and workers
  • Mounts SSH keys and configuration

Status and Monitoring

Multi-node deployments include additional status fields:
status:
  state: Ready
  availableReplicas: 1
  conditions:
  - type: NIM_SERVICE_READY
    status: "True"
  computeDomainStatus:
    ready: true
    message: "Compute domain configured successfully"

Monitoring MPI Cluster

Check leader pod logs for MPI cluster formation:
kubectl logs -n nim-service -l app=deepseek-r1-lws,nim-llm-role=leader
Expected output:
[MPI] Starting MPI cluster with 2 nodes
[MPI] Leader: deepseek-r1-lws-0-0
[MPI] Worker 1: deepseek-r1-lws-0-1
[MPI] Cluster formation complete
[NIM] Model loaded successfully

Troubleshooting

MPI Cluster Startup Timeout

Symptom: Pods fail to reach ready state Solution: Increase mpiStartTimeout:
multiNode:
  mpi:
    mpiStartTimeout: 9000  # 150 minutes

Storage Access Errors

Symptom: Permission denied or read-only filesystem errors Solution: Ensure PVC has ReadWriteMany access:
kubectl get pvc -n nim-service
# ACCESS MODES should show RWX

Network Communication Issues

Symptom: MPI errors about connection failures Solution: Check network configuration:
# From leader pod
kubectl exec -it <leader-pod> -n nim-service -- ping <worker-pod-ip>

# Verify SSH connectivity
kubectl exec -it <leader-pod> -n nim-service -- ssh <worker-hostname> echo success

Worker Pods Not Starting

Symptom: Only leader pod is running Solution: Verify LeaderWorkerSet is installed:
kubectl get crd leaderworkersets.leaderworkerset.x-k8s.io
kubectl get lws -n nim-service

Best Practices

Use Identical Nodes

Deploy across nodes with identical GPU types and configurations for consistent performance.

Enable RDMA

Use RDMA-capable networking (InfiniBand or RoCE) for optimal inter-node communication performance.

Right-Size Timeout

Set mpiStartTimeout based on model size:
  • 70B: 3000s (50 min)
  • 175B+: 6000s (100 min)

Monitor Resources

Track GPU utilization, network bandwidth, and MPI communication overhead to optimize parallelism strategy.

Test Connectivity First

Verify pod-to-pod networking and SSH connectivity before deploying large models.

Use Fast Storage

Choose high-performance RWX storage (NVMe-backed NFS, Lustre) for faster model loading.

Performance Tuning

Network Optimization

env:
# NCCL optimizations
- name: NCCL_IB_HCA
  value: mlx5  # Specify InfiniBand adapter
- name: NCCL_IB_GID_INDEX
  value: "3"
- name: NCCL_NET_GDR_LEVEL
  value: "5"  # Enable GPU Direct RDMA

# UCX optimizations  
- name: UCX_NET_DEVICES
  value: mlx5_0:1  # Specify RDMA device
- name: UCX_TLS
  value: rc,cuda_copy,cuda_ipc  # RDMA + GPU Direct

Model Loading Optimization

env:
- name: TENSOR_PARALLEL_SIZE
  value: "8"
- name: PIPELINE_PARALLEL_SIZE
  value: "2"
- name: MAX_NUM_BATCHED_TOKENS
  value: "8192"
- name: MAX_NUM_SEQS
  value: "256"

Build docs developers (and LLMs) love