Overview
Tensor Parallelism splits model layers across multiple GPUs, allowing:- Large Model Serving: Deploy models like Llama-3.1-70B that don’t fit on a single GPU
- Increased Throughput: Distribute computation across GPUs for better performance
- Efficient Memory Usage: Utilize memory from multiple GPUs simultaneously
Requirements
Hardware Requirements
Hardware Requirements
- Multiple GPUs: 2, 4, or 8 GPUs (must be a power of 2)
- NVLink: GPUs should be connected via NVLink for optimal performance
- VRAM: Combined memory must accommodate the model and KV cache
Software Requirements
Software Requirements
- CUDA Toolkit: Version matching your driver (check with
nvidia-smi) - NCCL: Automatically included with PyTorch
- Linux: Required for CUDA kernel support
Launching with Tensor Parallelism
Launch with --tp flag
Specify the number of GPUs to use with This will distribute the model across 4 GPUs.
--tp:Configuration Examples
2-GPU Setup (Llama-3.1-8B)
4-GPU Setup (Qwen-3-32B)
8-GPU Setup (Llama-3.1-405B)
Advanced Configuration
Custom Port
With Custom Cache Strategy
Attention Backend Selection
Performance Tuning
Chunked Prefill
Adjust the max prefill length for better memory efficiency:CUDA Graph Optimization
Set the maximum batch size for CUDA graph capture:Page Size Configuration
Benchmark Example
From the Mini-SGLang benchmarks, deploying Qwen3-32B on 4×H200 GPUs:- Hardware: 4×H200 GPU connected by NVLink
- Model: Qwen3-32B
- Dataset: Qwen trace (1000 requests)
Troubleshooting
NCCL Timeout Errors
NCCL Timeout Errors
Increase NCCL timeout:
Out of Memory
Out of Memory
- Reduce
--max-prefill-length - Reduce
--cuda-graph-max-bs - Increase number of GPUs with
--tp - Use a smaller model variant
Slow Performance
Slow Performance
- Verify GPUs are connected via NVLink:
nvidia-smi topo -m - Check GPU utilization:
nvidia-smi dmon - Ensure no other processes are using the GPUs
Model Not Splitting
Model Not Splitting
Verify the TP degree matches your available GPUs: