Automatic Backend Selection
If you don’t specify--attention-backend, SGLang makes a best effort to automatically select the most performant backend based on your hardware and model architecture.
MHA Models (e.g., Llama, Qwen)
- Hopper (H100, H200): Defaults to
fa3if using CUDA 12.3+ and model configuration is supported - Blackwell (B200): Defaults to
trtllm_mha, unless using speculative decoding withtopk > 1 - Other Architectures (Ampere, Ada): Defaults to
flashinferif available; otherwise falls back totriton
MLA Models (e.g., DeepSeek V3)
- Hopper: Defaults to
fa3(requires CUDA 12.3+) - Blackwell: Defaults to
trtllm_mla - Other Architectures: Defaults to
triton
Backend Support Matrix
MHA (Multi-Head Attention) Backends
| Backend | Page Size > 1 (native) | FP8 KV Cache | FP4 KV Cache | Spec topk=1 | Spec topk>1 | Sliding Window | MultiModal |
|---|---|---|---|---|---|---|---|
| FlashInfer | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
| FA3 (FlashAttention 3) | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| FA4 (FlashAttention 4) | 128 | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Triton | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Torch Native (SDPA) | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| FlexAttention (PyTorch) | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| TRTLLM MHA | 16, 32, 64 | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
| Dual Chunk FlashAttention | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AITER (ROCm) | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ |
| Wave (ROCm) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Ascend (NPU) | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ |
| Intel XPU | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ |
| Intel AMX (CPU) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
MLA (Multi-Head Latent Attention) Backends
| Backend | Native Page Sizes | FP8 KV Cache | FP4 KV Cache | Chunked Prefix Cache | Spec topk=1 | Spec topk>1 |
|---|---|---|---|---|---|---|
| FlashInfer MLA | 1 | ❌ | ✅ | ✅ | ✅ | ❌ |
| FlashMLA | 64 | ✅ | ✅ | ✅ | ✅ | ❌ |
| Cutlass MLA | 128 | ✅ | ✅ | ✅ | ✅ | ❌ |
| TRTLLM MLA (Blackwell) | 32, 64 | ✅ | ✅ | ✅ | ✅ | ❌ |
| FA3 (FlashAttention 3) | n/a | ❌ | ❌ | ✅ | ✅ | ⚠️ (page_size=1 only) |
| Triton | n/a | ❌ | ❌ | ❌ | ✅ | ⚠️ (page_size=1 only) |
| FA4 | 1 | ❌ | ✅ | ❌ | ❌ | ❌ |
| Ascend MLA (NPU) | 128 | ❌ | ❌ | ❌ | ❌ | ❌ |
Multimodal attention is selected by
--mm-attention-backend. The “MultiModal” column indicates whether a corresponding multimodal implementation exists for that backend family.Page Size and Prefix Cache: Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and
page_size = 64, it won’t fill a complete page and cannot be matched in the prefix cache. Use page_size = 1 for maximum prefix reuse (token-level matching).Backend Descriptions
FlashInfer
Best for: General-purpose MHA models on non-Hopper GPUs (A100, A40) High-performance attention implementation with broad feature support including FP8 KV cache, speculative decoding, and sliding window attention.FlashAttention 3 (FA3)
Best for: Hopper GPUs (H100, H200, H20) Default backend for Hopper machines. Optimized for SM90 architecture with excellent performance for both MHA and MLA models.FlashAttention 4 (FA4)
Best for: Blackwell GPUs (B200) and FP4 KV cache workloads Supports both prefill and decode on SM90 (Hopper) and SM100 (Blackwell). On Hopper, requirespage_size = 128.
FlashMLA
Best for: MLA models with FP8 KV cache on Hopper Specialized backend for MLA architecture with native support for FP8 and FP4 KV cache. Requirespage_size = 64.
TRTLLM MLA
Best for: Blackwell architecture (B200) with MLA models Optimized for Blackwell GPUs with excellent performance for MLA models. Supports FP8 and FP4 KV cache.TRTLLM MHA
Best for: Blackwell architecture (B200) with MHA models Optimized for Blackwell GPUs. Supportspage_size of 16, 32, or 64.
Triton
Best for: Development, debugging, and FP4 KV cache Flexible Triton-based implementation supporting FP4 KV cache and various advanced features. Good fallback option for unsupported configurations.Cutlass MLA
High-performance MLA backend using CUTLASS kernels. Requirespage_size = 128.
Platform-Specific Backends
AMD ROCm
AITER: Recommended for ROCm platformsAscend NPU
Intel XPU
Other Backends
Torch Native (SDPA): PyTorch’s scaled dot-product attentionGDN Attention Backends
GDN (Gated Delta Network) is a linear attention mechanism with O(n) complexity, used in hybrid models that alternate GDN linear attention layers with standard full attention layers (e.g., Qwen 3.5, Qwen 3 Next, Jet Nemotron, Jet VLM). GDN is not selected via--attention-backend; it is automatically activated when the model architecture requires it. The GDN linear attention layers have their own kernel backends, selected via --linear-attn-backend (default: triton).
| Backend | Decode | Prefill / Extend | Spec Decoding (Target Verify) |
|---|---|---|---|
| Triton (CUDA) | ✅ | ✅ | ✅ |
| Triton (AMD/ROCm) | ✅ | ✅ | ✅ |
| Triton (NPU) | ✅ | ✅ | ❌ |
| Triton (CPU) | ✅ | ✅ | ❌ |
| CuTe DSL (CUDA only) | ✅ | ❌ | ❌ |
Hybrid Attention (Experimental)
You can mix-and-match attention backends for prefill and decode. This is useful when one backend excels at prefill and another excels at decode.Speculative Decoding with Hybrid Attention
The backend used for draft decoding and target verification depends on--speculative-attention-mode:
--speculative-attention-mode decode(recommended): draft/verify use the decode backend--speculative-attention-mode prefill(default): draft/verify use the prefill backend
- If any attention backend is
trtllm_mha, speculative decoding supports only--speculative-eagle-topk 1 - For paged MHA backends with
--page-size > 1and--speculative-eagle-topk > 1, onlyflashinferis supported - CUDA Graph: the decode backend is always captured; the prefill backend is captured only when
--speculative-attention-mode prefill
Backend Selection Guide
Hopper GPUs (H100/H200)
Use FA3 for both MHA and MLA models. Best overall performance on SM90 architecture.
Blackwell GPUs (B200)
Use TRTLLM MLA for MLA models and TRTLLM MHA for MHA models. Optimized for SM100 architecture.
Ampere/Ada GPUs (A100/A40)
Use FlashInfer for best compatibility and performance on older architectures.
FP4 KV Cache
Use FA4 on Blackwell, FlashMLA on Hopper for MLA, or Triton as fallback.
FP8 KV Cache
Use FlashMLA or FA3 on Hopper, TRTLLM on Blackwell, FlashInfer on Ampere/Ada.
Long Context
Use Dual Chunk FlashAttention for million-token contexts, or FA3/FlashInfer with sliding window.
Best Practices
- Let SGLang auto-select: Unless you have specific requirements, let SGLang automatically choose the backend
- Match page size to backend: Check backend requirements for page size (e.g., FA4 requires 128 on Hopper)
- Consider KV cache format: Choose backends that support your desired KV cache dtype (FP8/FP4/BF16)
- Test on your workload: Different backends may perform differently depending on batch size, sequence length, and model size
- Monitor for graph breaks: Some backends work better with CUDA graphs than others
