Core Optimizations Overview
In-Flight Batching
Continuous batching for higher GPU utilizationProcess context and generation phases together
Paged KV Cache
Dynamic memory allocation with cross-request reuseReduce memory waste and enable sharing
Custom Attention Kernels
Fused attention operations for maximum efficiencyFlashAttention-2 and XQA optimizations
CUDA Graphs
Capture and replay sequences of GPU operationsMinimize CPU overhead
In-Flight Batching (Continuous Batching)
In-flight batching, also known as continuous batching or iteration-level batching, allows TensorRT-LLM to process requests in different stages (context vs. generation) within the same batch.Traditional vs. In-Flight Batching
- Traditional Batching
- In-Flight Batching
How It Works
Mixed-Phase Batching
The scheduler can batch:
- Context phase requests: Processing all prompt tokens (first pass)
- Generation phase requests: Producing one token per step
Dynamic Request Management
- New requests can join at any time
- Completed requests immediately free resources
- No waiting for batch boundaries
Implementation Detail: Context phase requests must appear before generation phase requests in the input tensor. This constraint is enforced by the
Scheduler component in tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py.Scheduling Parameters
Two key parameters control in-flight batching behavior:Maximum number of requests that can be scheduled simultaneously
- Controls how many concurrent requests the engine handles
- Set high to maximize throughput
- Tune at runtime without rebuilding
Maximum number of tokens processed in a single forward pass (after removing padding)
- Balances throughput vs. latency
- Higher = better GPU utilization but higher TTFT (time to first token)
- Lower = faster individual requests but lower throughput
- Start with 8192 (default)
- Increase if GPU utilization is low
- Decrease if TTFT SLOs are missed
Visual Example
The scheduler balancesmax_batch_size and max_num_tokens constraints:
See the Paged Attention, IFB, and Request Scheduling documentation for detailed scheduler visualizations.
Chunked Context (Chunked Prefill)
To further improve scheduling flexibility, TensorRT-LLM supports chunked context:- Splits long prompts into multiple chunks
- Allows large prompts to be interleaved with generation phase requests
- Prevents long prompts from blocking the scheduler
- Enables more stable TTFT across requests
Recommendation: Always enable chunked context (enabled by default with paged KV cache). It improves scheduling efficiency with minimal performance impact.
Paged KV Cache
The paged KV cache is one of the most impactful optimizations in TensorRT-LLM, enabling:- Dynamic memory allocation - no wasted memory on padding
- Cross-request reuse - share cached blocks between requests with common prefixes
- Memory offloading - keep blocks in CPU memory when GPU is full
Architecture
- Contiguous KV Cache (Inefficient)
- Paged KV Cache (Efficient)
Key Features
Block-Based Allocation
Block-Based Allocation
- KV cache divided into fixed-size blocks (configurable, must be power of 2)
- Blocks assigned to requests on-demand
- Typical block size: 16-128 tokens per block
- Multiple transformer layers packed into each block
Cross-Request Reuse
Cross-Request Reuse
Radix Tree Storage: Filled blocks are stored in a radix tree structure, allowing later requests to reuse cached computations.Example:Benefits:
- Reduces redundant computation
- Saves GPU memory (blocks shared, not duplicated)
- Especially effective for:
- System prompts (reused across many requests)
- Common prefixes in multi-turn conversations
- RAG workloads (similar context documents)
Prioritized Eviction
Prioritized Eviction
When memory is full, blocks are evicted using prioritized LRU:
- Each block has a priority (0-100)
- Lower priority blocks evicted first
- Within same priority, least-recently-used evicted first
Memory Offloading
Memory Offloading
Before evicting blocks from GPU memory, optionally offload to CPU memory:
- Blocks remain reusable (copied back when needed)
- Controlled by
host_cache_sizeparameter - Blocks below
secondary_offload_min_priorityare evicted directly
Multi-Pool Support
Multi-Pool Support
Separate pools for different attention configurations:
- Each unique combination of (attention_window_size, num_kv_heads) gets its own pool
- Supports models with variable attention window sizes
- Enables Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)
Configuration
Source Code:
tensorrt_llm/_torch/pyexecutor/resource_manager.py contains the KVCacheManager implementation.Detailed Documentation: See KV Cache SystemCustom Attention Kernels
TensorRT-LLM uses highly optimized custom attention kernels that significantly outperform naive implementations.Attention Backends
- TrtllmAttention (Default)
- FlashInferAttention
- VanillaAttention
Hand-optimized CUDA kernels - Maximum performanceFeatures:
- Context Phase: FlashAttention-2 for long sequences, vanilla attention for short
- Generation Phase: Masked Multi-Head Attention with multi-block optimization
- XQA Optimization: Specialized kernels for MQA/GQA in generation
- Fused Operations: RoPE, QKV bias, quantization/dequantization fused into attention
- FP8 Support: FP8 attention for both context and generation
- Paged KV Cache: Native support for block-based cache
FlashAttention-2
For context phase with long sequences, TensorRT-LLM uses the FlashAttention-2 algorithm:- Memory-efficient: O(N) memory instead of O(N²)
- IO-aware: Optimized for GPU memory hierarchy
- Fast: 2-4x speedup over standard attention
- Exact: Same result as standard attention (not approximate)
XQA Optimization
For Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) in the generation phase, TensorRT-LLM uses XQA (eXtended Query Attention) kernels:- Optimized for cases where
num_kv_heads < num_query_heads - Supports FP16, BF16, FP8, INT8 KV cache
- Works with paged KV cache
- Automatic heuristic decides between XQA and standard masked MHA
- Compute dtype: FP16, BF16
- KV cache dtype: FP16, BF16, FP8, INT8
- Block sizes: 8, 16, 32, 64, 128 tokens per block
CUDA Graphs
CUDA Graphs capture sequences of GPU operations and replay them with a single API call, dramatically reducing CPU overhead.The Problem: CPU Overhead
The Solution: CUDA Graphs
CUDA Graph Padding
Since CUDA Graphs require fixed shapes, TensorRT-LLM uses padding to maximize graph hit rate:- Capture graphs for multiple batch sizes: [1, 2, 4, 8, 16, 32, …]
- If batch size doesn’t match exactly, pad to nearest larger captured size
- Compute “wasted” padded tokens, but still faster than eager mode
Performance Impact: CUDA Graphs with padding can improve throughput by up to 22% on certain models and hardware configurations.
Automatic Management
CUDA Graphs are managed automatically by TensorRT-LLM:- Enabled by default for generation phase
- Cannot be used for context phase (variable sequence lengths)
- Warmup phase captures common batch sizes
- Runtime automatically selects best captured graph
CUDA Graphs are applied to the model forward pass in the
PyExecutor. See tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py for implementation details.Overlap Scheduler
The Overlap Scheduler hides CPU-bound work behind GPU computation, maximizing throughput.How It Works
- Without Overlap (Sequential)
- With Overlap (Concurrent)
Implementation
Enabled by default. To disable:
LLM(disable_overlap_scheduler=True)Additional Optimizations
Packed Tensors (Remove Input Padding)
All TensorRT-LLM operations use packed tensors with no padding:- Reduced memory usage
- No wasted FLOPS on padding tokens
- Required for in-flight batching
Quantization
Reduce memory and increase throughput with quantization:- FP8: ~2x memory reduction, minimal accuracy loss
- INT8: ~4x memory reduction, careful calibration required
- INT4: ~8x memory reduction, for extreme memory constraints
Speculative Decoding
Accelerate generation with draft models:- Small draft model proposes multiple tokens
- Large target model verifies in parallel
- Accepted tokens have same quality as target model
- Speedup: 1.5-3x depending on acceptance rate
- EAGLE, Medusa, n-gram, model-based drafting
Multi-GPU Optimizations
- Tensor Parallelism: Split layers across GPUs
- Pipeline Parallelism: Distribute layers across GPUs
- Disaggregated Serving: Separate prefill and decode on different GPU pools
- Prefill: High computation, low memory
- Decode: Low computation, high memory
- Optimize each pool independently
Performance Tuning Checklist
Enable Core Optimizations
Ensure these are enabled (most are default):
- ✅ In-flight batching (automatic)
- ✅ Paged KV cache (
enable_block_reuse=True) - ✅ Overlap scheduler (default, or set
disable_overlap_scheduler=False) - ✅ CUDA Graphs (automatic for generation phase)
- ✅ Chunked context (automatic with paged KV cache)
Tune Scheduling Parameters
- Set
max_batch_sizehigh (e.g., 256+) to avoid becoming bottleneck - Tune
max_num_tokens(start with 8192, increase for throughput, decrease for latency) - Consider workload characteristics (prompt length distribution)
Optimize KV Cache
- Set
free_gpu_memory_fraction=0.9to use most of GPU memory - Enable
host_cache_sizeif you have spare CPU memory - Configure retention policies for workloads with common prefixes
Choose Best Attention Backend
- Default
trtllmis recommended - Try
flashinferif you need FP8 KV cache - Benchmark both for your specific workload
System Architecture
Understand how components work together
Backend Selection
Choose the right backend for your use case