Overview
TensorRT-LLM supports multiple parallelism strategies for scaling inference:- Tensor Parallelism (TP): Split model weights across GPUs
- Pipeline Parallelism (PP): Split layers across GPUs
- Expert Parallelism (EP): Split experts in MoE models
- Context Parallelism (CP): Split long sequences across GPUs
- Disaggregated Serving: Separate prefill and decode phases
Tensor Parallelism
Split model layers horizontally across multiple GPUs. Best for models that don’t fit on a single GPU.Single-Node Multi-GPU
Tensor parallelism requires GPUs on the same node with fast interconnects (NVLink/NVSwitch).
Communication Backends
TensorRT-LLM supports multiple orchestrators for multi-GPU communication:- MPI (Default)
- Ray
- RPC
Uses MPI for GPU communication. Best for performance.
Pipeline Parallelism
Split model layers vertically across GPUs. Each GPU processes a subset of layers.Hybrid Parallelism
Combine tensor and pipeline parallelism:Expert Parallelism (MoE Models)
For Mixture-of-Experts models like Mixtral or DeepSeek-V3:Total GPUs:
tp_size × ep_size. For Mixtral-8x7B: 2 × 4 = 8 GPUs.Context Parallelism
Split long sequences across GPUs using ring attention or Ulysses:Context Parallelism Types
Context Parallelism Types
| Type | Description | Use Case |
|---|---|---|
ULYSSES | Split sequence dimension | Long sequences (>32K tokens) |
RING | Ring attention | Very long sequences (>128K) |
STAR | Star attention | Extreme lengths (>1M tokens) |
HELIX | Helix parallelism | MoE + context parallelism |
Multi-Node Deployment
Prerequisites
Slurm Deployment
trtllm-llmapi-launch handles MPI process spawning and GPU assignment automatically.Manual MPI Launch
Disaggregated Serving
Separate prefill (context) and decode (generation) phases onto different GPU pools for independent optimization.Why Disaggregated Serving?
Optimize TTFT
Dedicate GPUs to prefill with high parallelism for fast Time-to-First-Token
Optimize TPOT
Dedicate GPUs to decode with batching for low Time-Per-Output-Token
Prevent Interference
Prefill doesn’t delay token generation
Different GPU Types
Use H100 for prefill, L40 for decode
Architecture
Setup with trtllm-serve
KV Cache Exchange Backends
- NIXL (Recommended)
- UCX
- MPI
Default backend with dynamic scaling support.Configure underlying protocol:
Client Usage
- Routes request to context server (prefill)
- Transfers KV cache to generation server
- Generation server produces tokens
- Returns unified response
Performance Tuning
Overlap Scheduler (PyTorch)
Enable compute/communication overlap for multi-GPU:Can improve throughput by 10-15% for TP ≥ 2.
Attention Data Parallelism
Enable for models with TP:NCCL Optimization
Examples
Llama-70B on 4 GPUs
Llama-405B on 8 GPUs (Hybrid)
DeepSeek-V3 on 16 GPUs (2 Nodes)
Mixtral-8x7B with Expert Parallelism
Troubleshooting
MPI initialization fails
MPI initialization fails
Error:
MPI_Init failedSolutions:- Ensure MPI is installed:
mpirun --version - Use Ray orchestrator:
orchestrator_type="ray" - Set:
export TLLM_DISABLE_MPI=1
NCCL errors
NCCL errors
Error:
NCCL error: unhandled system errorSolutions:- Check NCCL version:
python -c "import torch; print(torch.cuda.nccl.version())" - Enable debug:
export NCCL_DEBUG=INFO - Disable IB if not available:
export NCCL_IB_DISABLE=1
Disaggregated KV cache transfer fails
Disaggregated KV cache transfer fails
Error:
Failed to transfer KV cacheSolutions:- Increase
max_tokens_in_bufferin config - Try different backend: NIXL → UCX → MPI
- Check network connectivity between context and gen servers
- Verify
TRTLLM_NIXL_KVCACHE_BACKENDenv var
Pipeline parallelism low throughput
Pipeline parallelism low throughput
Symptoms: Low GPU utilization with PPSolutions:
- Prefer tensor parallelism over pipeline
- Increase
max_batch_sizeto fill pipeline bubbles - Use hybrid TP+PP only for very large models
Best Practices
Choose parallelism strategy
- Single GPU: No parallelism
- 2-8 GPUs: Tensor parallelism
- >8 GPUs: Hybrid TP + PP
- MoE models: Expert parallelism
- Long sequences: Context parallelism
Monitor communication overhead
Check iteration latency in
/metrics endpoint. High latency indicates communication bottleneck.Next Steps
Production Guide
Production deployment best practices
Benchmarking
Measure distributed performance
Reference Configs
170+ optimized configurations
Disaggregated Examples
Complete disaggregated serving examples