Overview
Multi-node deployment enables SGLang to serve large models that exceed single-node GPU memory or require high throughput. This guide covers tensor parallelism, expert parallelism, and prefill-decode disaggregation across nodes.Prerequisites
- Multiple compute nodes with GPUs
- High-speed interconnect (InfiniBand, RoCE, or high-bandwidth Ethernet)
- Consistent network topology between nodes
- Shared storage or synchronized model weights
- NCCL 2.28.3 or later
Basic Multi-Node Setup
Two-Node Tensor Parallelism
Deploy a large model across two nodes with 8 GPUs each:Key Parameters
| Parameter | Description | Example |
|---|---|---|
--tp | Total tensor parallel size (GPUs across all nodes) | 16 |
--dist-init-addr | Master node IP and port for coordination | 192.168.1.10:20000 |
--nnodes | Total number of nodes | 2 |
--node-rank | Rank of current node (0 for master) | 0 or 1 |
SLURM Deployment
For HPC clusters with SLURM:MoE Models with Expert Parallelism
For DeepSeek-V3/R1 and other MoE models:MoE-Specific Parameters
| Parameter | Description | Recommended |
|---|---|---|
--ep | Expert parallel size | Same as --tp |
--moe-a2a-backend | All-to-all communication backend | deepep |
--enable-dp-attention | Enable data-parallel attention | For large MoE |
--enable-dp-lm-head | Enable data-parallel LM head | For large MoE |
--dp-size | Data parallel size | Same as --tp |
--ep-num-redundant-experts | Redundant expert copies | 32 for DeepSeek |
RDMA/InfiniBand Configuration
For optimal performance with RDMA:Verify RDMA Setup
NCCL Environment Variables
Launch with RDMA
Prefill-Decode Disaggregation
Separate prefill and decode stages for optimal resource utilization:Prefill Nodes
Decode Nodes
Router/Load Balancer
Kubernetes Multi-Node Deployment
See the Kubernetes deployment guide for StatefulSet and LeaderWorkerSet configurations.Quick Example
Network Configuration
Firewall Rules
Open required ports between nodes:Network Interface Selection
Network Topology
For optimal performance, ensure:- Low latency: < 10μs for InfiniBand, < 100μs for Ethernet
- High bandwidth: ≥ 200 Gbps per GPU
- Consistent topology: Same switch for all nodes (ideal)
Performance Optimization
NCCL Tuning
Memory Configuration
CPU Affinity
Monitoring
NCCL Logs
Network Bandwidth
GPU Utilization
Troubleshooting
NCCL Initialization Failures
Symptoms:- “NCCL initialization failed”
- Timeout waiting for other nodes
RDMA Errors
Symptoms:- “ibv_create_qp failed”
- “RDMA connection refused”
Model Loading Issues
Symptoms:- Different model versions on nodes
- Checksum mismatch
Out of Memory
Slow Performance
Best Practices
- Use InfiniBand/RoCE: Essential for multi-node at scale
- Enable hostNetwork: Reduces latency in containerized environments
- Set privileged mode: Required for RDMA device access
- Synchronize clocks: Use NTP to avoid timeout issues
- Test incrementally: Validate 2 nodes before scaling to more
- Monitor NCCL: Keep
NCCL_DEBUG=INFOin production - Use static IPs: Avoid DNS resolution delays
- Verify topology: Run
nvidia-smi topo -mon all nodes
Example Configurations
4-Node Llama 405B (FP16)
2-Node DeepSeek-V3
Next Steps
- Kubernetes Deployment - Orchestrate multi-node on K8s
- Cloud Platforms - Deploy across cloud regions
- Docker Deployment - Containerize multi-node setups
