Overview
Prefill-Decode (PD) Disaggregation separates LLM inference into two specialized instances:- Prefill instance: Handles computation-intensive prompt processing
- Decode instance: Handles memory-intensive token generation
Why PD Disaggregation?
Traditional unified engines that process prefill and decode together suffer from two key inefficiencies:Problem 1: Prefill Interruption
Incoming prefill batches frequently interrupt ongoing decode batches, causing substantial delays in token generation.Problem 2: DP Attention Imbalance
In data-parallel attention, one DP worker may process prefill while another handles decode simultaneously, leading to increased decode latency.Solution: Disaggregation
With PD disaggregation:- No prefill interruption of decode batches
- Balanced DP attention workloads
- Independent optimization per phase
- Better resource utilization
Architecture
Request Flow
Prefill Instance Lifecycle
-
Bootstrap Queue:
- Initialize sender for each request
- Handshake with decode instance
- Pre-allocate KV cache on decode side
- Move to Waiting Queue once complete
-
Waiting Queue:
- Pop requests for prefill forward pass
- Process through model
- Move to Inflight Queue
-
Inflight Queue:
- Non-blocking poll of transfer status
- Return request once KV cache transfer completes
Decode Instance Lifecycle
-
Prealloc Queue:
- Initialize receiver for each request
- Handshake with prefill instance
- Pre-allocate KV cache slots
- Move to Transfer Queue
-
Transfer Queue:
- Poll receiver for transfer status
- Move to Waiting Queue once transfer completes
-
Waiting Queue:
- Construct PrebuiltExtendBatch
- Populate metadata (skip prefill forward)
-
Running Batch:
- Merge resolved batch into running batch
- Execute decode forward passes
Transfer Backends
SGLang supports multiple KV cache transfer backends:| Backend | Description | Best For |
|---|---|---|
| Mooncake | RDMA-based high-performance transfers | Multi-node, InfiniBand/RoCE |
| NIXL | UCX/libfabric plugin system | Flexible multi-node |
| Ascend | Huawei Ascend NPU transfers | Ascend NPU deployments |
| Fake | No actual transfer (testing) | Single-node debugging |
Configuration
Basic Setup with Mooncake (Single Node)
Installation:Multi-Node Setup (DeepSeek-V3)
Transfer Backend Details
Mooncake
Requirements:- RDMA-based high-performance transfers
- NVLink support (recommended for NVL72)
- Custom memory pools for optimized transfers
NVLINK(orTrue): NVLink transportBAREX: BAR expansionINTRA_NODE_NVLINK: Intra-node NVLink
| Variable | Description | Default |
|---|---|---|
SGLANG_DISAGGREGATION_THREAD_POOL_SIZE | Worker threads per TP rank | int(0.75 * cpu_count) // 8 (4-12) |
SGLANG_DISAGGREGATION_QUEUE_SIZE | Parallel transfer queues | 4 |
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT | Bootstrap timeout (seconds) | 300 |
SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL | Cleanup interval (seconds) | 120 |
| Variable | Description | Default |
|---|---|---|
SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL | Heartbeat interval (seconds) | 5.0 |
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE | Max consecutive failures | 2 |
SGLANG_DISAGGREGATION_WAITING_TIMEOUT | KV cache wait timeout (seconds) | 300 |
NIXL
Installation:--disaggregation-ib-device with --disaggregation-transfer-backend nixl)
Backend Selection:
Ascend NPU
Requirements: Option 1: Memfabric HybridCombining with Other Parallelism
PD + TP + DP + EP (Full Stack)
Recommended production setup for DeepSeek-V3:PD + Pipeline Parallelism
Router Integration
SGLang Model Gateway provides load balancing and fault tolerance for PD disaggregation: Multiple prefill/decode instances:Profiling
To profile prefill or decode workers separately:Configuration Summary
| Parameter | Description | Default | Recommended |
|---|---|---|---|
--disaggregation-mode | Instance mode | None | prefill or decode |
--disaggregation-transfer-backend | Transfer backend | mooncake | mooncake or nixl |
--disaggregation-ib-device | InfiniBand device | None | Your IB device name |
--max-running-requests | Max concurrent (decode) | None | 128-256 |
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT | Bootstrap timeout | 300 | 600 for high TTFT |
SGLANG_MOONCAKE_CUSTOM_MEM_POOL | Custom memory pool | None | NVLINK for NVL72 |
Best Practices
- Use Mooncake for multi-node deployments with InfiniBand/RoCE
- Enable NVLink transport for NVL72 deployments
- Set appropriate timeouts based on your TTFT requirements
- Use router for load balancing across multiple instances
- Monitor transfer bandwidth to ensure optimal performance
- Profile instances separately using profiling flags
- Combine with DPA + EP for DeepSeek models
Troubleshooting
Transfer Timeout
Symptom: Requests timing out during KV cache transfer Solution: Increase timeouts:Bootstrap Connection Failed
Symptom: Decode instance can’t connect to prefill bootstrap server Solution: Check network connectivity and IB device:Low Transfer Bandwidth
Symptom: Slow KV cache transfers Solution: Enable NVLink transport (if available):Memory Cleanup Issues
Symptom: Memory not released after decode instance disconnects Solution: Adjust cleanup interval:Performance Tips
- Use RDMA (InfiniBand/RoCE) for multi-node transfers
- Enable NVLink for intra-node high-bandwidth transfers
- Tune thread pool size based on available CPU cores
- Adjust queue size for concurrent transfer batches
- Monitor heartbeat failures to detect network issues early
- Use multiple decode instances with router for high availability
Related Documentation
- SGLang Model Gateway - Router for PD disaggregation
- Data Parallelism - DPA for MLA models
- Expert Parallelism - EP for MoE models
- Pipeline Parallelism - PP with PD disaggregation
- Benchmark and Profiling - Profiling guide
