Overview
Multi-node deployment enables running large language models that require more GPUs than available on a single node. The NVIDIA NIM Operator uses MPI (Message Passing Interface) and LeaderWorkerSet to orchestrate distributed model serving across multiple GPU nodes.
What is Multi-Node NIM?
Multi-node NIM allows you to:
Deploy models requiring 8+ GPUs across multiple nodes
Leverage tensor and pipeline parallelism for distributed inference
Scale beyond single-node GPU limits
Optimize large model performance with MPI-based communication
Architecture
The leader pod handles API requests and coordinates inference across all worker pods using MPI for inter-process communication.
Prerequisites
LeaderWorkerSet CRD
Install the LeaderWorkerSet operator in your cluster: kubectl apply -f https://github.com/kubernetes-sigs/lws/releases/download/v0.4.0/manifests.yaml
GPU Nodes
Ensure you have multiple nodes with GPUs available. For optimal performance:
Use nodes with identical GPU types
Enable high-speed networking (RDMA preferred)
Configure GPU drivers and NVIDIA device plugin
Storage with RWX Support
Multi-node deployments require ReadWriteMany (RWX) storage for shared model access:
NFS
CephFS
Cloud storage (EFS, Filestore, Azure Files)
Network Configuration
Configure network for MPI communication:
Allow pod-to-pod communication across nodes
Open required MPI ports
For RDMA: Configure InfiniBand or RoCE
Basic Multi-Node Configuration
Minimal Example
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMCache
metadata :
name : deepseek-r1-nimcache
namespace : nim-service
spec :
source :
ngc :
modelPuller : nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
pullSecret : ngc-secret
authSecret : ngc-api-secret
model :
storage :
pvc :
create : true
storageClass : '' # RWX-capable storage class
size : "100Gi"
volumeAccessMode : ReadWriteMany
---
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : deepseek-r1
namespace : nim-service
spec :
image :
repository : nvcr.io/nim/deepseek-ai/deepseek-r1
tag : "1.7.3"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
storage :
nimCache :
name : deepseek-r1-nimcache
replicas : 1 # Number of leader-worker groups
resources :
limits :
nvidia.com/gpu : 8 # GPUs per pod
requests :
nvidia.com/gpu : 8
expose :
service :
type : ClusterIP
port : 8000
multiNode :
parallelism :
pipeline : 2 # 2 pipeline stages
tensor : 8 # 8-way tensor parallelism
mpi :
mpiStartTimeout : 6000
Multi-Node Configuration
Multi-node deployment configuration. When set, autoscaling must be disabled. Backend type for multi-node deployment. Currently only lws (LeaderWorkerSet) is supported.
Parallelism strategy configuration. Pipeline parallelism size (≥1). Number of pipeline stages, where each stage runs on a separate node. Example: pipeline: 2 means the model is split into 2 pipeline stages across 2 nodes.
Tensor parallelism size (≥1). Number of GPUs within each pipeline stage for parallel computation. Example: tensor: 8 means each pipeline stage uses 8 GPUs in parallel.
MPI configuration for multi-node communication. Timeout in seconds for starting the MPI cluster. Increase for large models or slow networks. Recommended values:
Small models (under 70B): 300-600 seconds
Large models (70B+): 3000-6000 seconds
Compute domain specification for resource allocation (advanced).
Autoscaling (spec.scale.enabled) cannot be used with multi-node configuration
When spec.multiNode is set, spec.replicas is not allowed
Understanding Parallelism
Tensor Parallelism
Splits individual layers across multiple GPUs for parallel computation within a single forward pass.
multiNode :
parallelism :
tensor : 8 # Uses 8 GPUs in parallel per pipeline stage
Total GPUs per pod = tensor parallelism
Example: tensor: 8 → Each pod uses 8 GPUs
Pipeline Parallelism
Splits the model into stages, where each stage runs on a different node/pod.
multiNode :
parallelism :
pipeline : 2 # Model split into 2 stages
tensor : 8 # Each stage uses 8 GPUs
Total pods in cluster = pipeline parallelism
Total GPUs = pipeline × tensor
Example: pipeline: 2, tensor: 8 → 2 pods, 16 GPUs total
Example Configurations
16 GPUs (2 Nodes)
32 GPUs (4 Nodes)
64 GPUs (8 Nodes)
multiNode :
parallelism :
pipeline : 2 # 2 nodes
tensor : 8 # 8 GPUs per node
# Total: 2 nodes × 8 GPUs = 16 GPUs
multiNode :
parallelism :
pipeline : 4 # 4 nodes
tensor : 8 # 8 GPUs per node
# Total: 4 nodes × 8 GPUs = 32 GPUs
multiNode :
parallelism :
pipeline : 8 # 8 nodes
tensor : 8 # 8 GPUs per node
# Total: 8 nodes × 8 GPUs = 64 GPUs
Complete Multi-Node Example
Here’s a production-ready configuration with all recommended settings:
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : deepseek-r1
namespace : nim-service
spec :
# Environment configuration for multi-node
env :
- name : NIM_USE_SGLANG
value : "1"
- name : HF_HOME
value : /model-store/huggingface/hub
- name : NUMBA_CACHE_DIR
value : /tmp/numba
# Network transport configuration
- name : UCX_TLS
value : ib,tcp,shm # InfiniBand, TCP, shared memory
- name : UCC_TLS
value : ucp
- name : UCC_CONFIG_FILE
value : " "
- name : GLOO_SOCKET_IFNAME
value : eth0 # Primary network interface
- name : NCCL_SOCKET_IFNAME
value : eth0 # NCCL network interface
- name : NIM_TRUST_CUSTOM_CODE
value : "1"
# Extended health probes for large model initialization
readinessProbe :
probe :
failureThreshold : 3
httpGet :
path : "/v1/health/ready"
port : "api"
initialDelaySeconds : 15
periodSeconds : 10
successThreshold : 1
timeoutSeconds : 1
startupProbe :
probe :
failureThreshold : 100
httpGet :
path : "/v1/health/ready"
port : "api"
initialDelaySeconds : 900 # 15 minutes
periodSeconds : 10
successThreshold : 1
timeoutSeconds : 1
# Container image
image :
repository : nvcr.io/nim/deepseek-ai/deepseek-r1
tag : "1.7.3"
pullPolicy : IfNotPresent
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
# Storage with RWX access
storage :
nimCache :
name : deepseek-r1-nimcache
replicas : 1
# Resource allocation
resources :
limits :
nvidia.com/gpu : 8
requests :
nvidia.com/gpu : 8
cpu : "32"
memory : 256Gi
# Service exposure
expose :
service :
type : ClusterIP
port : 8000
# Multi-node configuration
multiNode :
backendType : lws
parallelism :
pipeline : 2 # 2-stage pipeline
tensor : 8 # 8-way tensor parallelism
mpi :
mpiStartTimeout : 6000 # 100 minutes for large model
RDMA-Enabled Multi-Node
For optimal performance with high-speed networking:
apiVersion : apps.nvidia.com/v1alpha1
kind : NIMService
metadata :
name : deepseek-r1-rdma
namespace : nim-service
spec :
env :
- name : UCX_TLS
value : rc,shm # Reliable Connection (RDMA) + shared memory
- name : NCCL_IB_DISABLE
value : "0" # Enable InfiniBand for NCCL
- name : NCCL_NET
value : IB # Use InfiniBand network
- name : NCCL_SOCKET_IFNAME
value : ib0 # InfiniBand interface
image :
repository : nvcr.io/nim/deepseek-ai/deepseek-r1
tag : "1.7.3"
pullSecrets :
- ngc-secret
authSecret : ngc-api-secret
storage :
nimCache :
name : deepseek-r1-nimcache
replicas : 1
resources :
limits :
nvidia.com/gpu : 8
rdma/rdma_shared_device_a : 1 # RDMA device
requests :
nvidia.com/gpu : 8
rdma/rdma_shared_device_a : 1
expose :
service :
type : ClusterIP
port : 8000
multiNode :
parallelism :
pipeline : 2
tensor : 8
mpi :
mpiStartTimeout : 6000
RDMA significantly reduces latency for MPI communication, improving inference performance for multi-node deployments.
Resource Requirements
Storage
Multi-node deployments require ReadWriteMany access mode for shared model access across nodes.
storage :
pvc :
create : true
size : "100Gi"
volumeAccessMode : ReadWriteMany # Required for multi-node
storageClass : nfs-storage # Must support RWX
Compute Resources
Recommended resource allocation per pod:
Small Models (<70B)
Large Models (70B+)
resources :
limits :
nvidia.com/gpu : 8
cpu : "16"
memory : 128Gi
requests :
nvidia.com/gpu : 8
cpu : "8"
memory : 64Gi
resources :
limits :
nvidia.com/gpu : 8
cpu : "32"
memory : 256Gi
requests :
nvidia.com/gpu : 8
cpu : "16"
memory : 128Gi
MPI Configuration Details
Environment Variables
The operator automatically sets these MPI-related environment variables:
NIM_MULTI_NODE = 1
NIM_NUM_COMPUTE_NODES = 2
NIM_TENSOR_PARALLEL_SIZE = 8
NIM_PIPELINE_PARALLEL_SIZE = 2
NIM_NODE_RANK = 0
NIM_LEADER_ROLE = 1
OMPI_MCA_orte_keep_fqdn_hostnames = true
OMPI_MCA_plm_rsh_args = "-o ConnectionAttempts=20"
GPUS_PER_NODE = 8
CLUSTER_START_TIMEOUT = 6000
SSH Configuration
The operator automatically configures SSH for MPI:
Generates SSH key pairs
Configures passwordless SSH between leader and workers
Mounts SSH keys and configuration
Status and Monitoring
Multi-node deployments include additional status fields:
status :
state : Ready
availableReplicas : 1
conditions :
- type : NIM_SERVICE_READY
status : "True"
computeDomainStatus :
ready : true
message : "Compute domain configured successfully"
Monitoring MPI Cluster
Check leader pod logs for MPI cluster formation:
kubectl logs -n nim-service -l app=deepseek-r1-lws,nim-llm-role=leader
Expected output:
[MPI] Starting MPI cluster with 2 nodes
[MPI] Leader: deepseek-r1-lws-0-0
[MPI] Worker 1: deepseek-r1-lws-0-1
[MPI] Cluster formation complete
[NIM] Model loaded successfully
Troubleshooting
MPI Cluster Startup Timeout
Symptom : Pods fail to reach ready state
Solution : Increase mpiStartTimeout:
multiNode :
mpi :
mpiStartTimeout : 9000 # 150 minutes
Storage Access Errors
Symptom : Permission denied or read-only filesystem errors
Solution : Ensure PVC has ReadWriteMany access:
kubectl get pvc -n nim-service
# ACCESS MODES should show RWX
Network Communication Issues
Symptom : MPI errors about connection failures
Solution : Check network configuration:
# From leader pod
kubectl exec -it < leader-po d > -n nim-service -- ping < worker-pod-i p >
# Verify SSH connectivity
kubectl exec -it < leader-po d > -n nim-service -- ssh < worker-hostnam e > echo success
Worker Pods Not Starting
Symptom : Only leader pod is running
Solution : Verify LeaderWorkerSet is installed:
kubectl get crd leaderworkersets.leaderworkerset.x-k8s.io
kubectl get lws -n nim-service
Best Practices
Use Identical Nodes Deploy across nodes with identical GPU types and configurations for consistent performance.
Enable RDMA Use RDMA-capable networking (InfiniBand or RoCE) for optimal inter-node communication performance.
Right-Size Timeout Set mpiStartTimeout based on model size:
70B: 3000s (50 min)
175B+: 6000s (100 min)
Monitor Resources Track GPU utilization, network bandwidth, and MPI communication overhead to optimize parallelism strategy.
Test Connectivity First Verify pod-to-pod networking and SSH connectivity before deploying large models.
Use Fast Storage Choose high-performance RWX storage (NVMe-backed NFS, Lustre) for faster model loading.
Network Optimization
env :
# NCCL optimizations
- name : NCCL_IB_HCA
value : mlx5 # Specify InfiniBand adapter
- name : NCCL_IB_GID_INDEX
value : "3"
- name : NCCL_NET_GDR_LEVEL
value : "5" # Enable GPU Direct RDMA
# UCX optimizations
- name : UCX_NET_DEVICES
value : mlx5_0:1 # Specify RDMA device
- name : UCX_TLS
value : rc,cuda_copy,cuda_ipc # RDMA + GPU Direct
Model Loading Optimization
env :
- name : TENSOR_PARALLEL_SIZE
value : "8"
- name : PIPELINE_PARALLEL_SIZE
value : "2"
- name : MAX_NUM_BATCHED_TOKENS
value : "8192"
- name : MAX_NUM_SEQS
value : "256"