Overview
Tensor Parallelism (TP) is the most common parallelism strategy for LLM inference, where model weights are distributed across multiple GPUs within a single node. Each GPU holds a portion of each layer’s parameters, enabling models to scale beyond a single GPU’s memory capacity.How It Works
In tensor parallelism:- Model weights are sharded across multiple GPUs
- Each GPU computes a portion of each layer’s output
- All-reduce operations synchronize results across GPUs
- All GPUs process the same batch of requests
Key Characteristics
- Best suited for intra-node scaling (GPUs connected via NVLink/PCIe)
- Requires high-bandwidth communication for all-reduce operations
- Works well for models with standard attention mechanisms (GQA, MHA)
- Memory efficient: Each GPU stores only a portion of model weights
When to Use Tensor Parallelism
Use TP when:- Model doesn’t fit on a single GPU
- You have multiple GPUs in a single node with fast interconnects
- Working with standard attention models (Llama, Qwen, Mistral, etc.)
- You need low latency for small batch sizes
- Using MLA-based models (DeepSeek, MiniMax) → Use Data Parallelism Attention
- Scaling across multiple nodes → Use Pipeline Parallelism
- Working with MoE models → Combine with Expert Parallelism
Configuration
Basic Setup
Enable tensor parallelism with the--tp flag:
Multi-Node Tensor Parallelism
To run TP across multiple nodes:--disable-cuda-graph.
Peer-to-Peer Access
If you encounter the error “peer access is not supported between these two devices”, enable P2P checking:Combining with Other Parallelism
TP + Data Parallelism
Combine TP with DP for models that fit across multiple GPUs but need higher throughput:TP + Expert Parallelism (MoE Models)
For Mixture-of-Experts models, combine TP with EP:TP + Pipeline Parallelism
For very large models with long contexts:Communication Backends
SGLang supports multiple communication backends for all-reduce operations:Custom All-Reduce (Default)
Optimized all-reduce implementation for NVIDIA GPUs:- Automatically enabled for supported architectures
- Falls back to NCCL for unsupported tensor sizes
- Disable with
--disable-custom-all-reduce
PyNccl
Low-level NCCL wrapper for optimized GPU communication:- Used for CUDA graph mode
- Supports symmetric memory allocation
Hardware-Specific Backends
AMD (ROCm):Performance Tuning
Memory Management
Control KV cache memory allocation:--mem-fraction-static if you encounter OOM errors.
Attention Backend
Select the optimal attention implementation:Deterministic All-Reduce
For reproducible results (AMD GPUs):Troubleshooting
Deadlock During Initialization
Symptom: Server hangs during model loading with multi-node TP Solution:P2P Access Errors
Symptom: “peer access is not supported between these two devices” Solution:OOM Errors
Symptom: Out of memory during serving Solution:Communication Overhead
Symptom: Poor throughput with multi-node TP Solution: Consider Pipeline Parallelism for cross-node deployments:Configuration Summary
| Parameter | Description | Default | Recommended Values |
|---|---|---|---|
--tp | Tensor parallel size | 1 | Power of 2 (2, 4, 8) |
--nnodes | Number of nodes | 1 | 1-4 for TP |
--dist-init-addr | Master node address | None | <IP>:29500 |
--mem-fraction-static | KV cache memory fraction | 0.9 | 0.7-0.9 |
--enable-p2p-check | Check GPU P2P support | False | Enable if needed |
--disable-cuda-graph | Disable CUDA graphs | False | Enable for debugging |
Best Practices
- Start with single-node TP before scaling to multiple nodes
- Use power-of-2 TP sizes (2, 4, 8) for optimal performance
- Monitor GPU utilization to ensure balanced workloads
- Test P2P connectivity before production deployments
- Consider alternatives for MLA models and MoE architectures
Related Documentation
- Data Parallelism - For higher throughput with replicas
- Expert Parallelism - For MoE models
- Pipeline Parallelism - For multi-node scaling
- Server Arguments - Complete argument reference
