Overview
Expert Parallelism (EP) distributes expert weights across multiple devices in Mixture-of-Experts (MoE) models. This addresses memory bottlenecks for large-scale MoE models where tokens are dynamically routed to specialized experts across GPUs.Key Benefits
- Reduced memory footprint per GPU by sharding expert weights
- Higher throughput with optimized all-to-all communication
- Better scalability for models with 100+ experts
- Load balancing to minimize GPU utilization variance
When to Use Expert Parallelism
Use EP for:- Mixture-of-Experts models (DeepSeek, Mixtral, Qwen-MoE)
- Models with 64+ experts that don’t fit on a single GPU
- Large-scale deployments requiring maximum throughput
- DeepSeek-V2, DeepSeek-V3, DeepSeek-R1
- Mixtral-8x7B, Mixtral-8x22B
- Qwen2-57B-A14B, Qwen3-235B-A22B
Architecture
How EP Works
In a typical MoE layer with EP:- Token Routing: Each token is routed to top-K experts based on gating scores
- All-to-All Dispatch: Tokens are shuffled across GPUs to their assigned experts
- Expert Computation: Each GPU processes its local expert subset
- All-to-All Combine: Results are gathered back to original token positions
Configuration
Basic Setup
--tp: Tensor parallel size (intra-node parallelism)--ep: Expert parallel size (typically equals tp)--moe-a2a-backend: All-to-all communication backend--moe-runner-backend: Expert computation backend
Multi-Node Setup
Communication Backends
All-to-All Backends (--moe-a2a-backend)
| Backend | Description | Use Case | Constraints |
|---|---|---|---|
none (default) | Uses All-Reduce/All-Gather | Hybrid EP+TP (ep < tp) | |
deepep | DeepEP communication library | Large-scale EP deployments | ep == tp |
mooncake | Elastic inference with RDMA | Elastic EP serving | ep == tp |
mori | AMD ROCm-optimized all-to-all | AMD GPU deployments | ep == tp |
flashinfer | FlashInfer all-to-all | Large-scale EP | |
ascend_fuseep | Ascend NPU fused operator | Ascend NPU (decode only) | ep == tp |
DeepEP Dispatch Modes
DeepEP supports two dispatch modes:normal: Optimized for prefill workloads (high throughput)low_latency: Optimized for decode workloads (low latency, CUDA Graph compatible)
MoE Runner Backends (--moe-runner-backend)
| Backend | Description | Best For |
|---|---|---|
auto (default) | Auto-selects based on hardware/model | General use |
deep_gemm | DeepGEMM optimized GEMMs | FP8 block-wise quantization |
triton | Triton-based grouped GEMMs | Custom kernel development |
cutlass | CUTLASS-based GEMMs | NVIDIA architectures |
flashinfer_trtllm | FlashInfer + TensorRT-LLM | Blackwell with TRT-LLM |
flashinfer_cutlass | FlashInfer + CUTLASS | Blackwell with FP4/FP8 |
flashinfer_mxfp4 | FlashInfer MXFP4 variant | MXFP4 models |
flashinfer_cutedsl | FlashInfer with custom DSL | NVFP4 models |
Advanced Features
Two-Batch Overlap (TBO)
TBO splits requests into micro-batches, interleaving attention with dispatch/combine operations:- Up to 2× throughput improvement
- Hides communication latency behind computation
- No peak memory increase
Single-Batch Overlap (SBO)
SBO enables overlapping operations within a single batch (e.g., shared experts with communication):Expert Parallelism Load Balancer (EPLB)
EPLB addresses routing imbalances by analyzing expert activation statistics:- Collects expert activation statistics during inference
- Computes optimal expert arrangement to minimize variance
- Strategically places or replicates experts across GPUs
- Reduces idle cycles and improves load balance
- Increase batch sizes for stable statistics
- Configure periodic rebalancing (e.g., every 1000 requests)
- Monitor load balancedness ratio (mean/max computation time)
Hardware-Specific Configuration
NVIDIA GPUs
Standard setup:AMD GPUs (ROCm)
normal mode currently.
Huawei Ascend NPUs
Prefill instance:Combining with Other Parallelism
EP + TP
Most common combination:EP + DPA (Data Parallelism Attention)
For MLA-based MoE models like DeepSeek:EP + PP (Pipeline Parallelism)
For very large models:EP + Speculative Decoding
For speculative decoding with different precisions:Performance Tuning
Recommended Configuration
Tuning Triton Backend
For custom kernel optimization:Extending the EP Framework
SGLang’s EP framework is highly modular and extensible:Architecture
Adding New Backends
For new all-to-all dispatcher:- Implement
BaseDispatchersubclass withdispatchandcombinemethods - Register via
--moe-a2a-backend
- Define
MoeRunnerCoresubclass for grouped GEMMs - Register permute methods:
- Fused mode (static, torch.compile-compatible):
register_fused_func - Permute mode (dynamic):
register_pre_permuteandregister_post_permute
- Fused mode (static, torch.compile-compatible):
- Register via
--moe-runner-backend
Troubleshooting
Communication Backend Not Working
Symptom: Error initializing DeepEP/Mooncake Solution: Check backend constraints:Poor Load Balance
Symptom: High variance in GPU utilization Solution: Enable EPLB and increase batch size:Low Throughput
Symptom: Lower than expected throughput Solution: Enable overlap optimizations:Best Practices
- Set ep == tp for DeepEP/Mooncake backends
- Use
--deepep-mode autofor automatic dispatch mode switching - Enable TBO for maximum throughput (up to 2× improvement)
- Enable EPLB with large batch sizes for better load balance
- Monitor expert activation patterns to understand routing behavior
- Combine with DPA for MLA-based MoE models
Configuration Summary
| Parameter | Description | Default | Recommended |
|---|---|---|---|
--ep | Expert parallel size | 1 | Same as --tp |
--moe-a2a-backend | All-to-all backend | none | deepep |
--moe-runner-backend | MoE computation backend | auto | auto or deep_gemm |
--deepep-mode | DeepEP dispatch mode | normal | auto |
--enable-two-batch-overlap | Enable TBO | False | Enable for throughput |
--enable-eplb | Enable load balancer | False | Enable with large batches |
Related Documentation
- Data Parallelism - DPA for MLA models
- Tensor Parallelism - TP fundamentals
- Pipeline Parallelism - Multi-node scaling
- Large-Scale EP Blog - 96 GPU deployment guide
- EPLB Repository - Load balancer details
