Attention Variants
MHA
Multi-Head AttentionOne KV head per query head. Original Transformer design with maximum expressiveness.
MQA
Multi-Query AttentionSingle KV head shared across all query heads. Minimal KV cache memory.
GQA
Grouped-Query AttentionKV heads divided into groups. Balances memory and model quality.
All three attention variants are described in the Attention Is All You Need, Multi-Query Attention, and Grouped-Query Attention papers.
Attention Backends
TensorRT-LLM provides three attention backends optimized for different use cases:TRT-LLM Backend (Default)
The default and most optimized backend for production use:- Flash Attention for context phase
- Masked MHA with multi-block optimization for generation
- XQA kernels for MQA/GQA models
- FP8 input/output and KV cache quantization
- Fused QKV input support
- RoPE fusion
- Paged and contiguous KV cache
Recommended for all production deployments. Offers the best performance and supports all TensorRT-LLM features.
FlashInfer Backend
Performance-optimized backend with FlashInfer library:- In-flight batching
- Paged KV cache
- FP8 quantization for inputs and KV cache
- RoPE fusion
Vanilla Backend
Reference implementation for debugging and baseline comparisons:Context Phase Optimizations
Flash Attention
For the context (prefill) phase, TensorRT-LLM uses Flash Attention kernels:- Short sequences: Vanilla MHA implementation
- Long sequences: Flash Attention algorithm (reduces memory from O(N²) to O(N))
- Fused softmax and attention computation
- Minimal intermediate tensor materialization
- Flash Attention 1
- Flash Attention 2
Based on FlashAttention: Fast and Memory-Efficient Exact Attention
- Tiling-based algorithm
- IO-aware attention
- Reduces HBM reads/writes
FP8 Context FMHA
When FP8 quantization is enabled, context attention is further accelerated:FP8 Paged Context FMHA is supported on Ada, Hopper, and Blackwell GPUs.
Generation Phase Optimizations
Masked Multi-Head Attention
The generation (decode) phase uses a specialized masked MHA kernel:- On-the-fly QKV bias addition
- Fused RoPE application
- Dequantization/quantization support
- Multi-block mode for low GPU occupancy
Multi-Block Mode
When batch size and number of heads are small, multi-block mode distributes work across multiple CUDA thread blocks:Multi-block mode is triggered automatically by internal heuristics. It activates when sequences are long enough and GPU occupancy is low.
XQA Optimization
XQA (eXtended Query Attention) is a specialized kernel for MQA/GQA in the generation phase:- FP16 / BF16 compute data type
- FP16 / BF16 / FP8 / INT8 KV cache data type
- Paged KV cache (8 / 16 / 32 / 64 / 128 tokens per block)
XQA is enabled by default with automatic heuristics. Set
TRTLLM_FORCE_XQA=1 to always use XQA when the model configuration is supported.In-Flight Batching
TensorRT-LLM supports continuous batching of requests:- Context-phase sequences can be batched with generation-phase sequences
- Reduces latency and improves GPU utilization
- Requires packed (non-padded) input tensors
Chunked Context
Long contexts can be split into chunks for better batching:- Context chunks batch with generation tokens
- Increases total throughput
- Removes constraints on input length
- Better GPU utilization
Chunk size (except the last chunk) must be an integer multiple of the KV cache block size.
Advanced Features
Rotary Positional Embedding (RoPE)
RoPE is fused into the attention operator:rope_gpt_neox: Standard GPT-NeoX RoPErope_gptj: GPT-J variant
ALiBi (Attention with Linear Biases)
ALiBi slopes are computed on-the-fly:Cross Attention
Support for encoder-decoder models:Sliding Window Attention
Limited attention windows with cyclic KV cache:When input length exceeds
attention_window_size, sliding window attention is automatically activated in the context phase.Beam Search
Beam search is supported with cache indirection:cache_indirection tensor (shape [batch_size, beam_width, max_seqlen]) tracks which beam path to read KV cache from at each token position.
Performance Tuning
Backend Selection
Backend Selection
- TRT-LLM backend: Best overall performance, recommended for production
- FlashInfer backend: Good alternative, may be faster for specific workloads
- Vanilla backend: Only for debugging and validation
KV Cache Configuration
KV Cache Configuration
- Use paged KV cache for better memory efficiency
- Enable FP8 KV cache on Hopper+ GPUs (2x memory reduction)
- Configure
max_attention_windowfor models with limited attention - Enable host offloading for long-running sessions
Context Processing
Context Processing
- Enable chunked context for very long inputs
- Use FP8 quantization for context FMHA (Ada, Hopper, Blackwell)
- Batch context requests together when possible
Generation Phase
Generation Phase
- XQA kernels automatically optimize MQA/GQA models
- Multi-block mode improves performance at low batch sizes
- Use in-flight batching to mix context and generation phases
Code Example: Custom Attention Configuration
Additional Resources
Flash Attention Paper
Original Flash Attention algorithm
Flash Attention 2 Paper
Improved Flash Attention with better parallelism
GQA Paper
Grouped-Query Attention for efficient inference
KV Cache Documentation
Detailed KV cache configuration guide