Prefill-Decode Disaggregation

Overview

Prefill-Decode (PD) Disaggregation separates LLM inference into two specialized instances:

Prefill instance: Handles computation-intensive prompt processing
Decode instance: Handles memory-intensive token generation

This separation eliminates interference between phases and enables tailored optimizations for each.

Why PD Disaggregation?

Traditional unified engines that process prefill and decode together suffer from two key inefficiencies:

Problem 1: Prefill Interruption

Incoming prefill batches frequently interrupt ongoing decode batches, causing substantial delays in token generation.

Unified Engine:
[Decode] [Decode] [Prefill!] ← interrupts → [Wait...] [Decode] [Decode]
                     ↓
              Decode latency spike

Problem 2: DP Attention Imbalance

In data-parallel attention, one DP worker may process prefill while another handles decode simultaneously, leading to increased decode latency.

Unified DP Workers:
Worker 0: [Prefill ----------------]  ← compute-bound
Worker 1: [Decode]  ← waits for Worker 0

Solution: Disaggregation

With PD disaggregation:

Prefill Instance:
[Prefill] [Prefill] [Prefill] [Prefill]  ← continuous prefill processing
    ↓          ↓          ↓          ↓
  Transfer KV cache to decode instance

Decode Instance:
[Decode] [Decode] [Decode] [Decode]  ← uninterrupted token generation

Benefits:

No prefill interruption of decode batches
Balanced DP attention workloads
Independent optimization per phase
Better resource utilization

Architecture

Request Flow

Client Request
      ↓
   Router
      ↓
Prefill Instance
      ↓ (KV Cache Transfer)
Decode Instance
      ↓
Generated Tokens → Client

Prefill Instance Lifecycle

Bootstrap Queue:
- Initialize sender for each request
- Handshake with decode instance
- Pre-allocate KV cache on decode side
- Move to Waiting Queue once complete
Waiting Queue:
- Pop requests for prefill forward pass
- Process through model
- Move to Inflight Queue
Inflight Queue:
- Non-blocking poll of transfer status
- Return request once KV cache transfer completes

Decode Instance Lifecycle

Prealloc Queue:
- Initialize receiver for each request
- Handshake with prefill instance
- Pre-allocate KV cache slots
- Move to Transfer Queue
Transfer Queue:
- Poll receiver for transfer status
- Move to Waiting Queue once transfer completes
Waiting Queue:
- Construct PrebuiltExtendBatch
- Populate metadata (skip prefill forward)
Running Batch:
- Merge resolved batch into running batch
- Execute decode forward passes

Transfer Backends

SGLang supports multiple KV cache transfer backends:

Backend	Description	Best For
Mooncake	RDMA-based high-performance transfers	Multi-node, InfiniBand/RoCE
NIXL	UCX/libfabric plugin system	Flexible multi-node
Ascend	Huawei Ascend NPU transfers	Ascend NPU deployments
Fake	No actual transfer (testing)	Single-node debugging

Configuration

Basic Setup with Mooncake (Single Node)

Installation:

uv pip install mooncake-transfer-engine

Launch servers:

# Prefill instance
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --port 30000 \
  --disaggregation-ib-device mlx5_roce0

# Decode instance
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode decode \
  --port 30001 \
  --base-gpu-id 1 \
  --disaggregation-ib-device mlx5_roce0

# Router
python -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --host 0.0.0.0 \
  --port 8000

Multi-Node Setup (DeepSeek-V3)

# Prefill Node 0
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --disaggregation-ib-device mlx5_roce0 \
  --disaggregation-mode prefill \
  --host 192.168.1.10 \
  --port 30000 \
  --trust-remote-code \
  --dist-init-addr 192.168.1.10:5000 \
  --nnodes 2 \
  --node-rank 0 \
  --tp-size 16 \
  --dp-size 8 \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --mem-fraction-static 0.8

# Prefill Node 1
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --disaggregation-ib-device mlx5_roce0 \
  --disaggregation-mode prefill \
  --host 192.168.1.11 \
  --port 30000 \
  --trust-remote-code \
  --dist-init-addr 192.168.1.10:5000 \
  --nnodes 2 \
  --node-rank 1 \
  --tp-size 16 \
  --dp-size 8 \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --mem-fraction-static 0.8

# Decode Node 0
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --disaggregation-ib-device mlx5_roce0 \
  --disaggregation-mode decode \
  --host 192.168.1.20 \
  --port 30001 \
  --trust-remote-code \
  --dist-init-addr 192.168.1.20:5000 \
  --nnodes 2 \
  --node-rank 0 \
  --tp-size 16 \
  --dp-size 8 \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --mem-fraction-static 0.8 \
  --max-running-requests 128

# Decode Node 1
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --disaggregation-ib-device mlx5_roce0 \
  --disaggregation-mode decode \
  --host 192.168.1.21 \
  --port 30001 \
  --trust-remote-code \
  --dist-init-addr 192.168.1.20:5000 \
  --nnodes 2 \
  --node-rank 1 \
  --tp-size 16 \
  --dp-size 8 \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --mem-fraction-static 0.8 \
  --max-running-requests 128

Transfer Backend Details

Mooncake

Requirements:

uv pip install mooncake-transfer-engine

Features:

RDMA-based high-performance transfers
NVLink support (recommended for NVL72)
Custom memory pools for optimized transfers

NVLink Transport:

export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=NVLINK
export MC_FORCE_MNNVL=True

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --disaggregation-ib-device mlx5_roce0

Supported memory pools:

NVLINK (or True): NVLink transport
BAREX: BAR expansion
INTRA_NODE_NVLINK: Intra-node NVLink

Environment Variables: Prefill Server:

Variable	Description	Default
`SGLANG_DISAGGREGATION_THREAD_POOL_SIZE`	Worker threads per TP rank	`int(0.75 * cpu_count) // 8` (4-12)
`SGLANG_DISAGGREGATION_QUEUE_SIZE`	Parallel transfer queues	`4`
`SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT`	Bootstrap timeout (seconds)	`300`
`SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL`	Cleanup interval (seconds)	`120`

Decode Server:

Variable	Description	Default
`SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL`	Heartbeat interval (seconds)	`5.0`
`SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE`	Max consecutive failures	`2`
`SGLANG_DISAGGREGATION_WAITING_TIMEOUT`	KV cache wait timeout (seconds)	`300`

Example (relaxed timeouts for high TTFT):

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600

NIXL

Installation:

pip install nixl

Or build from source (if UCX is pre-installed):

git clone https://github.com/ai-dynamo/nixl.git
cd nixl
pip install . --config-settings=setup-args="-Ducx_path=/path/to/ucx"

Single Node:

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --port 30000 \
  --disaggregation-transfer-backend nixl

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode decode \
  --port 30001 \
  --base-gpu-id 1 \
  --disaggregation-transfer-backend nixl

python -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --host 0.0.0.0 --port 8000

Multi-Node: (same as Mooncake, replace --disaggregation-ib-device with --disaggregation-transfer-backend nixl) Backend Selection:

export SGLANG_DISAGGREGATION_NIXL_BACKEND=LIBFABRIC
# Available: UCX (default), LIBFABRIC, or any installed NIXL plugin

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend nixl \
  --port 30000

Ascend NPU

Requirements: Option 1: Memfabric Hybrid

pip install memfabric-hybrid==1.0.0
export ASCEND_MF_STORE_URL="tcp://192.168.1.1:50000"

Option 2: Mooncake

export ENABLE_ASCEND_TRANSFER_WITH_MOONCAKE=true

Set NPU Physical ID (required in containers):

export ASCEND_NPU_PHY_ID=0

Single Node:

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --port 30000 \
  --disaggregation-transfer-backend ascend

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode decode \
  --port 30001 \
  --base-gpu-id 1 \
  --disaggregation-transfer-backend ascend

python -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --host 0.0.0.0 --port 8000

Multi-Node (DeepSeek):

# Prefill Node 0
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --disaggregation-transfer-backend ascend \
  --disaggregation-mode prefill \
  --host 192.168.1.10 \
  --port 30000 \
  --trust-remote-code \
  --dist-init-addr 192.168.1.10:5000 \
  --nnodes 1 \
  --node-rank 0 \
  --tp-size 16

# Decode Node 0
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --disaggregation-transfer-backend ascend \
  --disaggregation-mode decode \
  --host 192.168.1.20 \
  --port 30001 \
  --trust-remote-code \
  --dist-init-addr 192.168.1.20:5000 \
  --nnodes 1 \
  --node-rank 0 \
  --tp-size 16

Combining with Other Parallelism

PD + TP + DP + EP (Full Stack)

Recommended production setup for DeepSeek-V3:

# Prefill instance
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --disaggregation-mode prefill \
  --tp 16 --dp-size 8 --ep 16 \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --disaggregation-ib-device mlx5_roce0

# Decode instance
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --disaggregation-mode decode \
  --tp 16 --dp-size 8 --ep 16 \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --disaggregation-ib-device mlx5_roce0 \
  --max-running-requests 128

PD + Pipeline Parallelism

# Prefill instance with PP
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.1 \
  --disaggregation-mode prefill \
  --tp 8 --pp-size 4 \
  --nnodes 4 --node-rank 0 \
  --dist-init-addr 192.168.1.10:29500 \
  --chunked-prefill-size 4096 \
  --disaggregation-ib-device mlx5_roce0

# Decode instance with PP
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.1 \
  --disaggregation-mode decode \
  --tp 8 --pp-size 4 \
  --nnodes 4 --node-rank 0 \
  --dist-init-addr 192.168.1.20:29500 \
  --disaggregation-ib-device mlx5_roce0

See Pipeline Parallelism for PP tuning details.

Router Integration

SGLang Model Gateway provides load balancing and fault tolerance for PD disaggregation: Multiple prefill/decode instances:

# Launch prefill instances
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --port 30000 --host 0.0.0.0

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --port 30001 --host 0.0.0.0

# Launch decode instances
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode decode \
  --port 30010 --host 0.0.0.0

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode decode \
  --port 30011 --host 0.0.0.0

# Launch router with multiple workers
python -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://localhost:30000 http://localhost:30001 \
  --decode http://localhost:30010 http://localhost:30011 \
  --host 0.0.0.0 --port 8000

See SGLang Model Gateway - PD Disaggregation for advanced routing policies.

Profiling

To profile prefill or decode workers separately:

# Profile prefill instance
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --profile-prefill  # or set SGLANG_PROFILE_PREFILL=1

# Profile decode instance
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode decode \
  --profile-decode  # or set SGLANG_PROFILE_DECODE=1

See Benchmark and Profiling Guide for details.

Configuration Summary

Parameter	Description	Default	Recommended
`--disaggregation-mode`	Instance mode	`None`	`prefill` or `decode`
`--disaggregation-transfer-backend`	Transfer backend	`mooncake`	`mooncake` or `nixl`
`--disaggregation-ib-device`	InfiniBand device	`None`	Your IB device name
`--max-running-requests`	Max concurrent (decode)	`None`	128-256
`SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT`	Bootstrap timeout	`300`	600 for high TTFT
`SGLANG_MOONCAKE_CUSTOM_MEM_POOL`	Custom memory pool	`None`	`NVLINK` for NVL72

Best Practices

Use Mooncake for multi-node deployments with InfiniBand/RoCE
Enable NVLink transport for NVL72 deployments
Set appropriate timeouts based on your TTFT requirements
Use router for load balancing across multiple instances
Monitor transfer bandwidth to ensure optimal performance
Profile instances separately using profiling flags
Combine with DPA + EP for DeepSeek models

Troubleshooting

Transfer Timeout

Symptom: Requests timing out during KV cache transfer Solution: Increase timeouts:

export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600

Bootstrap Connection Failed

Symptom: Decode instance can’t connect to prefill bootstrap server Solution: Check network connectivity and IB device:

# Verify IB device
ibstat

# Check host/port accessibility
telnet <prefill_host> <bootstrap_port>

# Ensure correct device name
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --disaggregation-ib-device mlx5_roce0  # Match your device

Low Transfer Bandwidth

Symptom: Slow KV cache transfers Solution: Enable NVLink transport (if available):

export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=NVLINK
export MC_FORCE_MNNVL=True

Or increase thread pool size:

export SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=12

Memory Cleanup Issues

Symptom: Memory not released after decode instance disconnects Solution: Adjust cleanup interval:

export SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL=60  # Clean up every 60s

Performance Tips

Use RDMA (InfiniBand/RoCE) for multi-node transfers
Enable NVLink for intra-node high-bandwidth transfers
Tune thread pool size based on available CPU cores
Adjust queue size for concurrent transfer batches
Monitor heartbeat failures to detect network issues early
Use multiple decode instances with router for high availability

SGLang Model Gateway - Router for PD disaggregation
Data Parallelism - DPA for MLA models
Expert Parallelism - EP for MoE models
Pipeline Parallelism - PP with PD disaggregation
Benchmark and Profiling - Profiling guide

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Why PD Disaggregation?

​Problem 1: Prefill Interruption

​Problem 2: DP Attention Imbalance

​Solution: Disaggregation

​Architecture

​Request Flow

​Prefill Instance Lifecycle

​Decode Instance Lifecycle

​Transfer Backends

​Configuration

​Basic Setup with Mooncake (Single Node)

​Multi-Node Setup (DeepSeek-V3)

​Transfer Backend Details

​Mooncake

​NIXL

​Ascend NPU

​Combining with Other Parallelism

​PD + TP + DP + EP (Full Stack)

​PD + Pipeline Parallelism

​Router Integration

​Profiling

​Configuration Summary

​Best Practices

​Troubleshooting

​Transfer Timeout

​Bootstrap Connection Failed

​Low Transfer Bandwidth

​Memory Cleanup Issues

​Performance Tips

​Related Documentation