Overview
Pipeline Parallelism (PP) distributes model layers across multiple nodes, enabling efficient processing of ultra-long context sequences. Unlike Tensor Parallelism which requires frequent all-reduce operations, PP only communicates at layer boundaries, achieving better computation-communication overlap for multi-node deployments.Why Pipeline Parallelism?
As LLMs scale toward trillion-parameter architectures and “infinite” context windows, serving infrastructure must evolve:- Long context bottleneck: Ultra-long sequences create prohibitive Time to First Token (TTFT)
- Multi-node communication: TP faces bottlenecks when scaling across nodes
- Better overlap: PP communicates only at pipeline stage boundaries
- Chunked prefill: Different chunks can be processed simultaneously across nodes
How It Works
Basic Pipeline Architecture
Dynamic Chunked Prefill
With chunked prefill, long sequences are split into chunks:Asynchronous Communication
SGLang implements micro-batching with non-blocking P2P communication:- Decoupled sync/async logic: Send operations return immediately, synchronization is deferred
- Multi-stream execution: Separate streams for forward pass, data transfers, and result processing
- Overlap computation and communication: While one micro-batch computes, the next prepares
When to Use Pipeline Parallelism
Use PP when:- Processing ultra-long contexts (64K+ tokens)
- Scaling across multiple nodes (2-8+ nodes)
- Communication bandwidth is limited between nodes
- Working with large models (100B+ parameters)
- Each node has multiple GPUs
- Model layers are too large for single GPU
Configuration
Basic Setup
Single Model - Multi-Node
With Dynamic Chunking
Dynamic chunking automatically adjusts chunk sizes to minimize pipeline bubbles:--chunked-prefill-size: Initial chunk size (larger when using dynamic chunking)SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTOR: Controls chunk size adaptation (0.6-0.85 recommended)
Dynamic Chunking
Why Dynamic Chunking?
Fixed chunk sizes cause pipeline bubbles because:- Transformer layers have non-uniform running time
- Longer prefix sequences take more time for same chunk size
- Bubbles propagate and accumulate across stages
How It Works
Dynamic chunking predicts optimal next chunk size to satisfy:- Model cumulative runtime as quadratic function of sequence length
- Solve for next chunk size given current prefix length L
- Align downward to nearest multiple of max(page-size, 64)
- Apply smoothing factor for stability
Tuning Dynamic Chunking
Step 1: Find Optimal Fixed Chunk Size Test different fixed chunk sizes:- 1.0: Follows prediction model strictly (may create very small tail chunks)
- 0.6-0.85: Recommended range for best balance
- 0: Disables dynamic adjustment (fixed chunking)
Layer Partition Optimization
For uneven layer divisions, place larger partitions on higher PP ranks:Case Studies
DeepSeek-V3.1 (128K Context, 4×H20 Nodes)
Fixed Chunking (Baseline):Qwen3-235B-A22B-FP8 (128K Context, 4×H20 Nodes)
Fixed Chunking:--disable-radix-cache is for reproducible benchmarking only. Remove in production.
Combining with Other Parallelism
PP + TP
Most common combination for large models:PP + TP + EP (MoE Models)
For Mixture-of-Experts models:PP + PD Disaggregation
Combine pipeline parallelism with prefill-decode disaggregation:Configuration Summary
| Parameter | Description | Default | Recommended |
|---|---|---|---|
--pp-size | Pipeline parallel size | 1 | 2-8 for multi-node |
--chunked-prefill-size | Initial chunk size | 8192 | 4K-8K (fixed), 12K-18K (dynamic) |
--enable-dynamic-chunking | Enable dynamic chunk sizing | False | Enable for 64K+ contexts |
SGLANG_DYNAMIC_CHUNKING_SMOOTH_FACTOR | Chunk adaptation rate | 0.75 | 0.6-0.85 |
SGLANG_PP_LAYER_PARTITION | Manual layer distribution | Auto | ”15,15,15,16” for uneven |
--mem-fraction-static | KV cache memory | 0.9 | 0.8 for long contexts |
Performance Tips
- Start with fixed chunking to establish baseline, then enable dynamic
- Use larger initial chunks (2-3× fixed optimal) with dynamic chunking
- Place larger partitions on higher ranks for uneven layer divisions
- Monitor pipeline bubbles using profiling tools
- Adjust smoothing factor based on your workload characteristics
Troubleshooting
High TTFT
Symptom: Long time to first token with long contexts Solution: Enable dynamic chunking with appropriate smoothing:Pipeline Bubbles
Symptom: Low GPU utilization on some pipeline stages Solution: Adjust layer partition:OOM During Long Context
Symptom: Out of memory with very long sequences Solution: Reduce chunk size and memory fraction:Best Practices
- Use PP for multi-node deployments over pure TP
- Combine with TP within each node for optimal performance
- Enable dynamic chunking for ultra-long contexts (64K+)
- Tune chunk sizes for your specific model and hardware
- Monitor communication overhead between pipeline stages
Related Documentation
- Tensor Parallelism - For intra-node scaling
- Prefill-Decode Disaggregation - For prefill/decode separation
- Expert Parallelism - For MoE models
- Chunked Pipeline Blog - Technical deep dive
