Skip to main content
Parallelism across multiple GPUs becomes necessary when either the model cannot fit in a single GPU’s memory, or a single GPU cannot deliver the desired performance. TensorRT-LLM supports multiple parallelism strategies for deployment on both single and multiple nodes.

Parallelism Types

Tensor Parallel (TP)

Shards model weights across GPUs. Best for small batch sizes and memory-constrained scenarios.

Pipeline Parallel (PP)

Distributes model layers across GPUs. Best for large models that don’t fit in single GPU memory.

Data Parallel (DP)

Replicates model across GPUs for different requests. Best for large batch sizes and high throughput.

Expert Parallel (EP)

Distributes experts across GPUs for MoE models. Best for models with high expert count.

Context Parallel (CP)

Distributes context processing across GPUs. Best for long context scenarios.

Wide Expert Parallel

Advanced EP with load balancing for large-scale MoE models. Best for DeepSeek-V3/R1, LLaMA4, Qwen3.

Attention Module Parallelism

Tensor Parallelism for Attention

# config.yaml
tensor_parallel_size: 8
enable_attention_dp: false  # Use TP for attention (default)
trtllm-serve meta-llama/Llama-3.1-70B-Instruct --config config.yaml

Data Parallelism for Attention

# config.yaml
tensor_parallel_size: 8
enable_attention_dp: true  # Use DP for attention
trtllm-serve meta-llama/Llama-3.1-70B-Instruct --config config.yaml

FFN Module Parallelism

Dense Models

For dense (non-MoE) models, tensor parallelism is supported:
# config.yaml
tensor_parallel_size: 8
FFN weights are sharded across all GPUs, with results combined through all-reduce.

Mixture of Experts (MoE)

MoE models replace a single FFN with multiple experts. TensorRT-LLM supports three execution patterns:
# config.yaml
tensor_parallel_size: 8
moe_tensor_parallel_size: 8
How it works:
  • Every expert’s weight matrix is sliced across all GPUs
  • Each GPU sees all tokens
  • Higher communication overhead
  • Better load balancing
The product of moe_tensor_parallel_size and moe_expert_parallel_size must equal tensor_parallel_size.

Wide Expert Parallelism (Wide-EP)

Wide-EP is TensorRT-LLM’s advanced solution for large-scale MoE model inference, addressing workload imbalance through intelligent load balancing.

Motivation

Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, and Qwen3 introduce challenges:
  • High memory demands for expert weights
  • Inherent expert-level workload imbalance due to sparse execution
  • Communication overhead in distributed expert parallelism
  • Hot expert problem where certain experts receive significantly more tokens

Key Features

Wide-EP introduces expert slots decoupled from specific experts:
  • Multiple replicas of hot experts across different GPUs
  • Dynamic expert placement based on workload patterns
  • Both offline and online load balancing strategies
  • Optimized for NVIDIA GB200 Multi-Node NVLink (MNNVL)
  • Efficient all-to-all communication for expert dispatch and combine
  • Reduced communication overhead vs traditional EP
  • Offline EPLB: Pre-computed expert placement based on historical workload statistics
  • Online EPLB: Dynamic expert placement that adapts to real-time traffic patterns
  • Layer-wise weight redistribution to minimize inference disruption

Architecture Overview

Wide-EP separates experts (model perspective) from slots (engine perspective):
Model has: Expert 0, Expert 1, Expert 2, ..., Expert N
Engine has: Slot 0, Slot 1, Slot 2, ..., Slot M

Routing Table: Expert ID → Slot ID (updated by load balancer)
This allows:
  • Same expert to be replicated in multiple slots
  • Dynamic remapping based on workload
  • Load balancing without model retraining

Best Practices

1

Start with offline EPLB

For production deployments with known workload patterns, use offline EPLB to pre-compute optimal expert placement.
2

Use online EPLB for dynamic workloads

When traffic patterns change frequently or are unpredictable, enable online EPLB for real-time adaptation.
3

Monitor expert statistics

Track which experts receive the most tokens to understand workload distribution and validate load balancing effectiveness.
4

Tune max_num_tokens

Balance memory constraints and EP size by adjusting the maximum number of tokens per expert.
5

Test with representative datasets

Validate load balancing with datasets that match your production workload.
For detailed implementation examples and advanced usage, see:

Practical Configuration Examples

Single Node Deployment

# Tensor parallelism across 8 GPUs
tensor_parallel_size: 8
enable_attention_dp: false
trtllm-serve meta-llama/Llama-3.1-70B-Instruct --config config.yaml

Multi-Node Deployment

# Cross-node tensor parallelism
tensor_parallel_size: 16
pipeline_parallel_size: 1
Requires multi-node orchestration (MPI, Ray, or Slurm).

Benchmarking Parallelism Strategies

Test different parallelism configurations:
# Test TP-8
trtllm-bench --model meta-llama/Llama-3.1-70B-Instruct \
  throughput \
  --dataset /path/to/dataset.json \
  --config tp8_config.yaml

# Test DP-8 (attention)
trtllm-bench --model meta-llama/Llama-3.1-70B-Instruct \
  throughput \
  --dataset /path/to/dataset.json \
  --config dp8_config.yaml

# Compare throughput and latency
For optimal performance, consult the reference configs database which contains 170+ pareto-optimized configurations across multiple models and GPUs.

Performance Tuning Guide

Use TP when:
  • Batch size is small (1-8)
  • Model doesn’t fit in single GPU memory
  • Low latency is critical
Use DP when:
  • Batch size is large (16+)
  • Model fits in single GPU memory
  • High throughput is the goal
Use pure TP when:
  • Expert count is low (8-16 experts)
  • Load is balanced across experts
  • Maximum kernel efficiency is needed
Use pure EP when:
  • Expert count is high (64+ experts)
  • Memory per GPU is limited
  • Communication bandwidth is high
Use Hybrid ETP when:
  • Balancing between TP and EP benefits
  • Moderate expert count (16-64)
  • Need workload and kernel efficiency balance
  • Only use for large-scale MoE models (DeepSeek-V3, LLaMA4, Qwen3)
  • Monitor expert hit rates to identify hot experts
  • Start with offline EPLB, migrate to online if workload varies
  • Ensure high-bandwidth interconnect (NVLink, InfiniBand)

Additional Resources

Wide-EP Technical Blog

Deep dive into Wide Expert Parallelism

DeepSeek-V3 Paper

Research paper on large-scale MoE architecture

EPLB Implementation

Expert Parallelism Load Balancer reference

Reference Configs

170+ optimized serving configurations

Build docs developers (and LLMs) love