Multi-Node GPU Deployment - NVIDIA NIM Operator

Overview

Multi-node deployment enables running large language models that require more GPUs than available on a single node. The NVIDIA NIM Operator uses MPI (Message Passing Interface) and LeaderWorkerSet to orchestrate distributed model serving across multiple GPU nodes.

What is Multi-Node NIM?

Multi-node NIM allows you to:

Deploy models requiring 8+ GPUs across multiple nodes
Leverage tensor and pipeline parallelism for distributed inference
Scale beyond single-node GPU limits
Optimize large model performance with MPI-based communication

Architecture

The leader pod handles API requests and coordinates inference across all worker pods using MPI for inter-process communication.

Prerequisites

LeaderWorkerSet CRD

Install the LeaderWorkerSet operator in your cluster:

kubectl apply -f https://github.com/kubernetes-sigs/lws/releases/download/v0.4.0/manifests.yaml

GPU Nodes

Ensure you have multiple nodes with GPUs available. For optimal performance:

Use nodes with identical GPU types
Enable high-speed networking (RDMA preferred)
Configure GPU drivers and NVIDIA device plugin

Storage with RWX Support

Multi-node deployments require ReadWriteMany (RWX) storage for shared model access:

NFS
CephFS
Cloud storage (EFS, Filestore, Azure Files)

Network Configuration

Configure network for MPI communication:

Allow pod-to-pod communication across nodes
Open required MPI ports
For RDMA: Configure InfiniBand or RoCE

Basic Multi-Node Configuration

Minimal Example

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: deepseek-r1-nimcache
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
  storage:
    pvc:
      create: true
      storageClass: ''  # RWX-capable storage class
      size: "100Gi"
      volumeAccessMode: ReadWriteMany

---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: deepseek-r1
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/deepseek-ai/deepseek-r1
    tag: "1.7.3"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: deepseek-r1-nimcache
  replicas: 1  # Number of leader-worker groups
  resources:
    limits:
      nvidia.com/gpu: 8  # GPUs per pod
    requests:
      nvidia.com/gpu: 8
  expose:
    service:
      type: ClusterIP
      port: 8000
  multiNode:
    parallelism:
      pipeline: 2  # 2 pipeline stages
      tensor: 8    # 8-way tensor parallelism
    mpi:
      mpiStartTimeout: 6000

Multi-Node Configuration

spec.multiNode

object

Multi-node deployment configuration. When set, autoscaling must be disabled.

Show properties

backendType

string

default:"lws"

Backend type for multi-node deployment. Currently only lws (LeaderWorkerSet) is supported.

parallelism

object

required

Parallelism strategy configuration.

Show properties

pipeline

integer

required

Pipeline parallelism size (≥1). Number of pipeline stages, where each stage runs on a separate node.Example: pipeline: 2 means the model is split into 2 pipeline stages across 2 nodes.

tensor

integer

required

Tensor parallelism size (≥1). Number of GPUs within each pipeline stage for parallel computation.Example: tensor: 8 means each pipeline stage uses 8 GPUs in parallel.

mpi

object

MPI configuration for multi-node communication.

Show properties

mpiStartTimeout

integer

default:"300"

Timeout in seconds for starting the MPI cluster. Increase for large models or slow networks.Recommended values:

Small models (under 70B): 300-600 seconds
Large models (70B+): 3000-6000 seconds

computeDomain

object

Compute domain specification for resource allocation (advanced).

Autoscaling (spec.scale.enabled) cannot be used with multi-node configuration
When spec.multiNode is set, spec.replicas is not allowed

Understanding Parallelism

Tensor Parallelism

Splits individual layers across multiple GPUs for parallel computation within a single forward pass.

multiNode:
  parallelism:
    tensor: 8  # Uses 8 GPUs in parallel per pipeline stage

Total GPUs per pod = tensor parallelism Example: tensor: 8 → Each pod uses 8 GPUs

Pipeline Parallelism

Splits the model into stages, where each stage runs on a different node/pod.

multiNode:
  parallelism:
    pipeline: 2  # Model split into 2 stages
    tensor: 8    # Each stage uses 8 GPUs

Total pods in cluster = pipeline parallelism
Total GPUs = pipeline × tensor Example: pipeline: 2, tensor: 8 → 2 pods, 16 GPUs total

Example Configurations

16 GPUs (2 Nodes)
32 GPUs (4 Nodes)
64 GPUs (8 Nodes)

multiNode:
  parallelism:
    pipeline: 2  # 2 nodes
    tensor: 8    # 8 GPUs per node
  # Total: 2 nodes × 8 GPUs = 16 GPUs

multiNode:
  parallelism:
    pipeline: 4  # 4 nodes
    tensor: 8    # 8 GPUs per node
  # Total: 4 nodes × 8 GPUs = 32 GPUs

multiNode:
  parallelism:
    pipeline: 8  # 8 nodes
    tensor: 8    # 8 GPUs per node
  # Total: 8 nodes × 8 GPUs = 64 GPUs

Complete Multi-Node Example

Here’s a production-ready configuration with all recommended settings:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: deepseek-r1
  namespace: nim-service
spec:
  # Environment configuration for multi-node
  env:
  - name: NIM_USE_SGLANG
    value: "1"
  - name: HF_HOME
    value: /model-store/huggingface/hub
  - name: NUMBA_CACHE_DIR
    value: /tmp/numba
  # Network transport configuration
  - name: UCX_TLS
    value: ib,tcp,shm  # InfiniBand, TCP, shared memory
  - name: UCC_TLS
    value: ucp
  - name: UCC_CONFIG_FILE
    value: " "
  - name: GLOO_SOCKET_IFNAME
    value: eth0  # Primary network interface
  - name: NCCL_SOCKET_IFNAME
    value: eth0  # NCCL network interface
  - name: NIM_TRUST_CUSTOM_CODE
    value: "1"
  
  # Extended health probes for large model initialization
  readinessProbe:
    probe:
      failureThreshold: 3
      httpGet:
        path: "/v1/health/ready"
        port: "api"
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
  
  startupProbe:
    probe:
      failureThreshold: 100
      httpGet:
        path: "/v1/health/ready"
        port: "api"
      initialDelaySeconds: 900  # 15 minutes
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
  
  # Container image
  image:
    repository: nvcr.io/nim/deepseek-ai/deepseek-r1
    tag: "1.7.3"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  
  authSecret: ngc-api-secret
  
  # Storage with RWX access
  storage:
    nimCache:
      name: deepseek-r1-nimcache
  
  replicas: 1
  
  # Resource allocation
  resources:
    limits:
      nvidia.com/gpu: 8
    requests:
      nvidia.com/gpu: 8
      cpu: "32"
      memory: 256Gi
  
  # Service exposure
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  # Multi-node configuration
  multiNode:
    backendType: lws
    parallelism:
      pipeline: 2  # 2-stage pipeline
      tensor: 8    # 8-way tensor parallelism
    mpi:
      mpiStartTimeout: 6000  # 100 minutes for large model

RDMA-Enabled Multi-Node

For optimal performance with high-speed networking:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: deepseek-r1-rdma
  namespace: nim-service
spec:
  env:
  - name: UCX_TLS
    value: rc,shm  # Reliable Connection (RDMA) + shared memory
  - name: NCCL_IB_DISABLE
    value: "0"  # Enable InfiniBand for NCCL
  - name: NCCL_NET
    value: IB  # Use InfiniBand network
  - name: NCCL_SOCKET_IFNAME
    value: ib0  # InfiniBand interface
  
  image:
    repository: nvcr.io/nim/deepseek-ai/deepseek-r1
    tag: "1.7.3"
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  
  storage:
    nimCache:
      name: deepseek-r1-nimcache
  
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 8
      rdma/rdma_shared_device_a: 1  # RDMA device
    requests:
      nvidia.com/gpu: 8
      rdma/rdma_shared_device_a: 1
  
  expose:
    service:
      type: ClusterIP
      port: 8000
  
  multiNode:
    parallelism:
      pipeline: 2
      tensor: 8
    mpi:
      mpiStartTimeout: 6000

RDMA significantly reduces latency for MPI communication, improving inference performance for multi-node deployments.

Resource Requirements

Storage

volumeAccessMode

default:"ReadWriteMany"

Multi-node deployments require ReadWriteMany access mode for shared model access across nodes.

storage:
  pvc:
    create: true
    size: "100Gi"
    volumeAccessMode: ReadWriteMany  # Required for multi-node
    storageClass: nfs-storage  # Must support RWX

Compute Resources

Recommended resource allocation per pod:

Small Models (<70B)
Large Models (70B+)

resources:
  limits:
    nvidia.com/gpu: 8
    cpu: "16"
    memory: 128Gi
  requests:
    nvidia.com/gpu: 8
    cpu: "8"
    memory: 64Gi

resources:
  limits:
    nvidia.com/gpu: 8
    cpu: "32"
    memory: 256Gi
  requests:
    nvidia.com/gpu: 8
    cpu: "16"
    memory: 128Gi

MPI Configuration Details

Environment Variables

The operator automatically sets these MPI-related environment variables:

NIM_MULTI_NODE=1
NIM_NUM_COMPUTE_NODES=2
NIM_TENSOR_PARALLEL_SIZE=8
NIM_PIPELINE_PARALLEL_SIZE=2
NIM_NODE_RANK=0
NIM_LEADER_ROLE=1
OMPI_MCA_orte_keep_fqdn_hostnames=true
OMPI_MCA_plm_rsh_args="-o ConnectionAttempts=20"
GPUS_PER_NODE=8
CLUSTER_START_TIMEOUT=6000

SSH Configuration

The operator automatically configures SSH for MPI:

Generates SSH key pairs
Configures passwordless SSH between leader and workers
Mounts SSH keys and configuration

Status and Monitoring

Multi-node deployments include additional status fields:

status:
  state: Ready
  availableReplicas: 1
  conditions:
  - type: NIM_SERVICE_READY
    status: "True"
  computeDomainStatus:
    ready: true
    message: "Compute domain configured successfully"

Monitoring MPI Cluster

Check leader pod logs for MPI cluster formation:

kubectl logs -n nim-service -l app=deepseek-r1-lws,nim-llm-role=leader

Expected output:

[MPI] Starting MPI cluster with 2 nodes
[MPI] Leader: deepseek-r1-lws-0-0
[MPI] Worker 1: deepseek-r1-lws-0-1
[MPI] Cluster formation complete
[NIM] Model loaded successfully

Troubleshooting

MPI Cluster Startup Timeout

Symptom: Pods fail to reach ready state Solution: Increase mpiStartTimeout:

multiNode:
  mpi:
    mpiStartTimeout: 9000  # 150 minutes

Storage Access Errors

Symptom: Permission denied or read-only filesystem errors Solution: Ensure PVC has ReadWriteMany access:

kubectl get pvc -n nim-service
# ACCESS MODES should show RWX

Network Communication Issues

Symptom: MPI errors about connection failures Solution: Check network configuration:

# From leader pod
kubectl exec -it <leader-pod> -n nim-service -- ping <worker-pod-ip>

# Verify SSH connectivity
kubectl exec -it <leader-pod> -n nim-service -- ssh <worker-hostname> echo success

Worker Pods Not Starting

Symptom: Only leader pod is running Solution: Verify LeaderWorkerSet is installed:

kubectl get crd leaderworkersets.leaderworkerset.x-k8s.io
kubectl get lws -n nim-service

Best Practices

Use Identical Nodes

Deploy across nodes with identical GPU types and configurations for consistent performance.

Enable RDMA

Use RDMA-capable networking (InfiniBand or RoCE) for optimal inter-node communication performance.

Right-Size Timeout

Set mpiStartTimeout based on model size:

70B: 3000s (50 min)
175B+: 6000s (100 min)

Monitor Resources

Track GPU utilization, network bandwidth, and MPI communication overhead to optimize parallelism strategy.

Test Connectivity First

Verify pod-to-pod networking and SSH connectivity before deploying large models.

Use Fast Storage

Choose high-performance RWX storage (NVMe-backed NFS, Lustre) for faster model loading.

Performance Tuning

Network Optimization

env:
# NCCL optimizations
- name: NCCL_IB_HCA
  value: mlx5  # Specify InfiniBand adapter
- name: NCCL_IB_GID_INDEX
  value: "3"
- name: NCCL_NET_GDR_LEVEL
  value: "5"  # Enable GPU Direct RDMA

# UCX optimizations  
- name: UCX_NET_DEVICES
  value: mlx5_0:1  # Specify RDMA device
- name: UCX_TLS
  value: rc,cuda_copy,cuda_ipc  # RDMA + GPU Direct

Model Loading Optimization

env:
- name: TENSOR_PARALLEL_SIZE
  value: "8"
- name: PIPELINE_PARALLEL_SIZE
  value: "2"
- name: MAX_NUM_BATCHED_TOKENS
  value: "8192"
- name: MAX_NUM_SEQS
  value: "256"

NIMService Resource - Complete NIMService configuration reference
NIMCache Resource - Model caching with RWX storage
LeaderWorkerSet Documentation - Upstream LWS project

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Overview

​What is Multi-Node NIM?

​Architecture

​Prerequisites

​Basic Multi-Node Configuration

​Minimal Example

​Multi-Node Configuration

​Understanding Parallelism

​Tensor Parallelism

​Pipeline Parallelism

​Example Configurations

​Complete Multi-Node Example

​RDMA-Enabled Multi-Node

​Resource Requirements

​Storage

​Compute Resources

​MPI Configuration Details

​Environment Variables

​SSH Configuration

​Status and Monitoring

​Monitoring MPI Cluster

​Troubleshooting

​MPI Cluster Startup Timeout

​Storage Access Errors

​Network Communication Issues

​Worker Pods Not Starting

​Best Practices

Use Identical Nodes

Enable RDMA

Right-Size Timeout

Monitor Resources

Test Connectivity First

Use Fast Storage

​Performance Tuning

​Network Optimization

​Model Loading Optimization

​Related Resources

Build docs developers (and LLMs) love

Overview

What is Multi-Node NIM?

Architecture

Prerequisites

Basic Multi-Node Configuration

Minimal Example

Multi-Node Configuration

Understanding Parallelism

Tensor Parallelism

Pipeline Parallelism

Example Configurations

Complete Multi-Node Example

RDMA-Enabled Multi-Node

Resource Requirements

Storage

Compute Resources

MPI Configuration Details

Environment Variables

SSH Configuration

Status and Monitoring

Monitoring MPI Cluster

Troubleshooting

MPI Cluster Startup Timeout

Storage Access Errors

Network Communication Issues

Worker Pods Not Starting

Best Practices

Performance Tuning

Network Optimization

Model Loading Optimization

Related Resources