Tensor Parallelism

Overview

Tensor Parallelism (TP) is the most common parallelism strategy for LLM inference, where model weights are distributed across multiple GPUs within a single node. Each GPU holds a portion of each layer’s parameters, enabling models to scale beyond a single GPU’s memory capacity.

How It Works

In tensor parallelism:

Model weights are sharded across multiple GPUs
Each GPU computes a portion of each layer’s output
All-reduce operations synchronize results across GPUs
All GPUs process the same batch of requests

Key Characteristics

Best suited for intra-node scaling (GPUs connected via NVLink/PCIe)
Requires high-bandwidth communication for all-reduce operations
Works well for models with standard attention mechanisms (GQA, MHA)
Memory efficient: Each GPU stores only a portion of model weights

When to Use Tensor Parallelism

Use TP when:

Model doesn’t fit on a single GPU
You have multiple GPUs in a single node with fast interconnects
Working with standard attention models (Llama, Qwen, Mistral, etc.)
You need low latency for small batch sizes

Consider alternatives when:

Using MLA-based models (DeepSeek, MiniMax) → Use Data Parallelism Attention
Scaling across multiple nodes → Use Pipeline Parallelism
Working with MoE models → Combine with Expert Parallelism

Configuration

Basic Setup

Enable tensor parallelism with the --tp flag:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4

This distributes the model across 4 GPUs on a single node.

Multi-Node Tensor Parallelism

To run TP across multiple nodes:

# Node 0 (Master)
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8 \
  --nnodes 2 \
  --node-rank 0 \
  --dist-init-addr <MASTER_NODE_IP>:29500

# Node 1
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8 \
  --nnodes 2 \
  --node-rank 1 \
  --dist-init-addr <MASTER_NODE_IP>:29500

Important: Multi-node TP requires fast interconnects (InfiniBand, RoCE). If you experience deadlocks, add --disable-cuda-graph.

Peer-to-Peer Access

If you encounter the error “peer access is not supported between these two devices”, enable P2P checking:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --enable-p2p-check

Combining with Other Parallelism

TP + Data Parallelism

Combine TP with DP for models that fit across multiple GPUs but need higher throughput:

python -m sglang_router.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --dp-size 2

This creates 2 replicas, each using 4-way TP (8 GPUs total).

TP + Expert Parallelism (MoE Models)

For Mixture-of-Experts models, combine TP with EP:

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 \
  --ep 8 \
  --moe-a2a-backend deepep

See Expert Parallelism for details.

TP + Pipeline Parallelism

For very large models with long contexts:

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.1 \
  --tp 8 \
  --pp-size 4 \
  --chunked-prefill-size 4096

See Pipeline Parallelism for details.

Communication Backends

SGLang supports multiple communication backends for all-reduce operations:

Custom All-Reduce (Default)

Optimized all-reduce implementation for NVIDIA GPUs:

Automatically enabled for supported architectures
Falls back to NCCL for unsupported tensor sizes
Disable with --disable-custom-all-reduce

PyNccl

Low-level NCCL wrapper for optimized GPU communication:

Used for CUDA graph mode
Supports symmetric memory allocation

Hardware-Specific Backends

AMD (ROCm):

# QuickAllReduce for MI300+ GPUs
export SGLANG_USE_1STAGE_ALLREDUCE=0  # 2-stage for large tensors
python -m sglang.launch_server --model-path ... --tp 8

Intel (XPU):

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --device xpu

Huawei Ascend (NPU):

export HCCL_BUFFSIZE=256  # Set HCCL buffer size (MB)
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8

Performance Tuning

Memory Management

Control KV cache memory allocation:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --mem-fraction-static 0.85  # Use 85% of GPU memory for KV cache

Reduce --mem-fraction-static if you encounter OOM errors.

Attention Backend

Select the optimal attention implementation:

# FlashAttention-3 (recommended for H100/H200)
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --attention-backend fa3

# FlashInfer (recommended for A100/A10)
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --attention-backend flashinfer

Deterministic All-Reduce

For reproducible results (AMD GPUs):

export SGLANG_USE_1STAGE_ALLREDUCE=1
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8 \
  --enable-deterministic-inference

Troubleshooting

Deadlock During Initialization

Symptom: Server hangs during model loading with multi-node TP Solution:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 8 \
  --nnodes 2 \
  --node-rank 0 \
  --dist-init-addr <MASTER_NODE_IP>:29500 \
  --disable-cuda-graph

P2P Access Errors

Symptom: “peer access is not supported between these two devices” Solution:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --enable-p2p-check

OOM Errors

Symptom: Out of memory during serving Solution:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --mem-fraction-static 0.7  # Reduce KV cache size

For long prompts, enable chunked prefill:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --chunked-prefill-size 4096

Communication Overhead

Symptom: Poor throughput with multi-node TP Solution: Consider Pipeline Parallelism for cross-node deployments:

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tp 4 \
  --pp-size 2 \
  --chunked-prefill-size 4096

Configuration Summary

Parameter	Description	Default	Recommended Values
`--tp`	Tensor parallel size	`1`	Power of 2 (2, 4, 8)
`--nnodes`	Number of nodes	`1`	1-4 for TP
`--dist-init-addr`	Master node address	`None`	`<IP>:29500`
`--mem-fraction-static`	KV cache memory fraction	`0.9`	0.7-0.9
`--enable-p2p-check`	Check GPU P2P support	`False`	Enable if needed
`--disable-cuda-graph`	Disable CUDA graphs	`False`	Enable for debugging

Best Practices

Start with single-node TP before scaling to multiple nodes
Use power-of-2 TP sizes (2, 4, 8) for optimal performance
Monitor GPU utilization to ensure balanced workloads
Test P2P connectivity before production deployments
Consider alternatives for MLA models and MoE architectures

Data Parallelism - For higher throughput with replicas
Expert Parallelism - For MoE models
Pipeline Parallelism - For multi-node scaling
Server Arguments - Complete argument reference

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

Tensor Parallelism

Overview

How It Works

Key Characteristics

When to Use Tensor Parallelism

Configuration

Basic Setup

Multi-Node Tensor Parallelism

Peer-to-Peer Access

Combining with Other Parallelism

TP + Data Parallelism

TP + Expert Parallelism (MoE Models)

TP + Pipeline Parallelism

Communication Backends

Custom All-Reduce (Default)

PyNccl

Hardware-Specific Backends

Performance Tuning

Memory Management

Attention Backend

Deterministic All-Reduce

Troubleshooting

Deadlock During Initialization

P2P Access Errors

OOM Errors

Communication Overhead

Configuration Summary

Best Practices

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​How It Works

​Key Characteristics

​When to Use Tensor Parallelism

​Configuration

​Basic Setup

​Multi-Node Tensor Parallelism

​Peer-to-Peer Access

​Combining with Other Parallelism

​TP + Data Parallelism

​TP + Expert Parallelism (MoE Models)

​TP + Pipeline Parallelism

​Communication Backends

​Custom All-Reduce (Default)

​PyNccl

​Hardware-Specific Backends

​Performance Tuning

​Memory Management

​Attention Backend

​Deterministic All-Reduce

​Troubleshooting

​Deadlock During Initialization

​P2P Access Errors

​OOM Errors

​Communication Overhead

​Configuration Summary

​Best Practices

​Related Documentation

Overview

How It Works

Key Characteristics

When to Use Tensor Parallelism

Configuration

Basic Setup

Multi-Node Tensor Parallelism

Peer-to-Peer Access

Combining with Other Parallelism

TP + Data Parallelism

TP + Expert Parallelism (MoE Models)

TP + Pipeline Parallelism

Communication Backends

Custom All-Reduce (Default)

PyNccl

Hardware-Specific Backends

Performance Tuning

Memory Management

Attention Backend

Deterministic All-Reduce

Troubleshooting

Deadlock During Initialization

P2P Access Errors

OOM Errors

Communication Overhead

Configuration Summary

Best Practices

Related Documentation