Overview
Disaggregated serving separates the context (prefill) and generation (decode) phases of LLM inference onto different GPU pools. This architecture eliminates interference between phases and enables independent optimization of time-to-first-token (TTFT) and token-per-output-token (TPOT) metrics.Motivation
LLM inference consists of two distinct phases with different compute characteristics:- Context (Prefill): Computes KV cache for all prompt tokens in parallel
- Generation (Decode): Generates tokens one by one using cached values
Aggregated vs Disaggregated Serving
Aggregated Serving (Traditional) In aggregated serving, both phases share the same GPU resources and parallelism strategy. This leads to:- Context processing delays token generation, increasing TPOT
- Reduced interactivity due to interference
- Single GPU type and parallelism configuration for both phases
- Optimizing one metric (TTFT) comes at the expense of another (TPOT)
- Running phases on separate GPU pools with different parallelism strategies
- Removing interference between context and generation
- Enabling independent optimization of TTFT and TPOT
- Allowing different GPU types for each phase
Architecture
KV Cache Exchange
The KV cache exchange module is modularly decoupled from the KV cache manager and underlying communication libraries. It handles:- Efficient transmission and reception of cache blocks
- Prompt cache space release
- Cache layout conversions during exchange
- RDMA / NVLink communication

Multi-Backend Support
TensorRT-LLM supports multiple communication protocols:NIXL
Default backend with dynamic scaling support
UCX
Recommended backend with dynamic node joining/leaving
MPI
Traditional MPI-based communication
NIXL Backend Configuration
NIXL supports multiple underlying communication backends configured via theTRTLLM_NIXL_KVCACHE_BACKEND environment variable:
- UCX (default)
- LIBFABRIC (available from v0.16.0)
Optimizations
Overlap Optimization
TensorRT-LLM overlaps KV cache transmission with computation for multiple independent requests:- While one request sends/receives KV cache blocks, other requests proceed with computation
- If instances use multiple GPUs, KV cache transmission between different GPU sets occurs in parallel
- Significantly reduces end-to-end latency

Cache Layout Transformation
Disaggregated serving supports different parallelism strategies for context and generation phases:- Direct device-to-device memory transfer minimizes latency
- Automatic KV cache block mapping between different parallel configurations
- Example: Context with TP2 → Generation with PP2

Unique Global Request ID
For end-to-end request tracking, provide a unique global request ID:Use values larger than
1 << 42 = 4398046511104 to avoid collisions with worker-local or warm-up requests. Do not route context and generation requests with the same ID to the same worker.Setup and Configuration
Using trtllm-serve
Step 1: Configure Context and Generation Servers
Create configuration files for each server type: context_config.ymlSet
max_tokens_in_buffer greater than or equal to the maximum Input Sequence Length (ISL) of all requests for optimal performance.Step 2: Launch Context and Generation Servers
Step 3: Launch Disaggregated Orchestrator
Create the orchestrator configuration: disagg_config.yaml
Step 4: Send Requests
The disaggregated server provides an OpenAI-compatible endpoint:- Routes requests to context servers (marked as “context-only”)
- Context server returns prompt tokens, first generated token, and
ctx_paramsmetadata - Orchestrator forwards
ctx_paramsto generation servers - Generation server retrieves KV cache blocks and completes generation
Using Dynamo
For production deployments, Dynamo provides:- Data center-scale inference server for LLM workloads
- Decoupled pre- and post-processing workers for high concurrency
- Smart router for optimal decode worker selection
- Built-in Kubernetes deployment support
- Monitoring and metrics collection
- Dynamic instance scaling (in development)

SLURM Deployment
For SLURM cluster deployments, see the disaggregated inference benchmark scripts.Multiple Orchestrator Instances
To increase maximum concurrency without additional GPU nodes, deploy multiple disaggregated server instances across different nodes, each managing the same context/generation servers: Two-node example:- Node A: Context servers at
node-a:8001, Generation servers atnode-b:8002, Orchestrator atnode-a:8000 - Node B: Same context/generation servers, Orchestrator at
node-b:8000 - Clients: Send requests to both
node-a:8000andnode-b:8000(use load balancer)
Environment Variables
Communication Backend
Performance Tuning
Memory Management
CUDA and Communication
UCX Configuration (Different NVLink Domains)
If servers run on different NVLink domains (check withnvidia-smi -q → Fabric.ClusterUUID):
Troubleshooting
FAQs
What are the limitations?
What are the limitations?
- Only decoder-only models supported
- Beam width must be 1
- KV cache must be homogeneous (same data type and number of attention heads at each layer)
Can context and generation use different engines?
Can context and generation use different engines?
Yes. When using the TensorRT backend, context and generation instances can use different engines with different parallelism (TP, PP). TensorRT-LLM handles KV cache heterogeneity automatically.
Can one instance handle both context and generation?
Can one instance handle both context and generation?
Yes, but not recommended. TensorRT-LLM does not implement optimal scheduling for mixed workloads. Run context-only and generation-only requests on separate server sets.
Multi-GPU and multi-node support?
Multi-GPU and multi-node support?
Yes. Different instances should use different GPUs (control with
CUDA_VISIBLE_DEVICES). Context and generation servers can run on the same node or different nodes.Why is LIBFABRIC backend not working?
Why is LIBFABRIC backend not working?
The TensorRT-LLM container doesn’t include the NIXL LIBFABRIC plugin by default. Either:
- Rebuild NIXL with libfabric and hwloc installed
- Set
NIXL_PLUGINS_DIRto a directory containing a compatiblelibplugin_LIBFABRIC.so
Why low bandwidth during first requests?
Why low bandwidth during first requests?
Communication channels are established dynamically. Connection establishment incurs significant overhead during initial requests. Perform a warm-up phase before benchmarking.
Does TensorRT-LLM support GPU Direct RDMA?
Does TensorRT-LLM support GPU Direct RDMA?
Yes, TensorRT-LLM supports GPU Direct RDMA for inter-node KV cache transfer.