Attention Backends

Mini-SGLang integrates multiple high-performance attention backends to optimize inference across different GPU architectures and workload phases. You can choose different backends for prefill and decode phases to maximize efficiency.

Supported Backends

Mini-SGLang supports three attention backends:

fa - FlashAttention (github.com/Dao-AILab/flash-attention)
fi - FlashInfer (github.com/flashinfer-ai/flashinfer)
trtllm - TensorRT-LLM FMHA (github.com/NVIDIA/TensorRT-LLM)

Configuration

Use the --attn or --attention-backend flag to specify which backend(s) to use:

# Auto-select optimal backends for your GPU
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn auto

# Use FlashAttention for both prefill and decode
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa

# Use FlashAttention for prefill, FlashInfer for decode
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa,fi

Hybrid Backend Configuration

When you specify two backends separated by a comma, the first backend is used for prefill and the second for decode:

--attn <prefill_backend>,<decode_backend>

Example:

python -m minisgl --model "meta-llama/Llama-3-8B" --attn fa,fi

This configuration uses:

FlashAttention (fa) for the prefill phase
FlashInfer (fi) for the decode phase

Backend Details

FlashAttention (`fa`)

FlashAttention provides highly optimized attention computation through IO-aware algorithms. Key Features:

Supports FlashAttention 3 on NVIDIA Hopper GPUs (SM100+)
Falls back to FlashAttention 3 on older architectures
Efficient memory usage through tiling
Requires sgl-kernel package

Implementation: minisgl.attention.fa.FlashAttentionBackend (source:~/workspace/source/python/minisgl/attention/fa.py) Installation:

pip install sgl-kernel

If you encounter import errors, you may need to install system dependencies:

apt update && apt install libnuma1

FlashInfer (`fi`)

FlashInfer specializes in efficient decode-phase attention with optional tensor core usage. Key Features:

Optimized for decode phase with batched requests
Configurable tensor core usage based on GQA ratio
Currently only supports page size = 1
Uses FlashAttention 2 backend internally

Implementation: minisgl.attention.fi.FlashInferBackend (source:~/workspace/source/python/minisgl/attention/fi.py) Tensor Core Usage: By default, tensor cores are enabled when GQA (num_qo_heads / num_kv_heads) >= 4. You can override this with the FLASHINFER_USE_TENSOR_CORES environment variable.

FLASHINFER_USE_TENSOR_CORES=1 python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fi

TensorRT-LLM (`trtllm`)

TensorRT-LLM FMHA backend provides optimized attention through NVIDIA’s TensorRT-LLM library. Key Features:

Supports both prefill and decode phases
Integrates with TensorRT-LLM optimizations
Page size constraint: Only supports page sizes of 16, 32, or 64

Implementation: minisgl.attention.trtllm.TensorRTLLMBackend (source:~/workspace/source/python/minisgl/attention/trtllm.py)

When using trtllm backend, the page size will be overridden if you specify a value other than 16, 32, or 64.

Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --attn trtllm --page-size 16

Default Backend Selection

When you use --attn auto (the default), Mini-SGLang automatically selects optimal backends based on your GPU architecture:

NVIDIA Hopper GPUs (SM100+): FlashAttention 3 for prefill, FlashInfer for decode
Other GPUs: FlashAttention 3 for both prefill and decode

The auto-selection considers:

GPU compute capability
Available installed kernels
Model configuration (GQA ratio, head dimensions)

Page Size Constraints

Different attention backends have different page size requirements:

Backend	Page Size Support	Notes
`fa` (FlashAttention)	Any size	Recommended: 1 for flexibility
`fi` (FlashInfer)	1 only	Hardcoded constraint
`trtllm`	16, 32, 64 only	Will override user setting

Specify page size with the --page-size flag:

python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa --page-size 16

Performance Recommendations

For High Throughput

Use hybrid FlashAttention + FlashInfer configuration:

python -m minisgl \
  --model "meta-llama/Llama-3-8B" \
  --attn fa,fi \
  --page-size 1 \
  --max-running-requests 512

For Long Context

FlashAttention is optimized for long sequences:

python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --attn fa \
  --max-seq-len-override 32768

For Low Latency

FlashInfer with tensor cores for decode:

FLASHINFER_USE_TENSOR_CORES=1 python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --attn fi

Source Code Reference

All attention backends are implemented in ~/workspace/source/python/minisgl/attention/:

FlashAttention: fa.py:36 (FlashAttentionBackend class)
FlashInfer: fi.py:80 (FlashInferBackend class)
TensorRT-LLM: trtllm.py:35 (TensorRTLLMBackend class)
Backend registry: __init__.py:19 (SUPPORTED_ATTENTION_BACKENDS)

Troubleshooting

Import Errors with FlashAttention

If you see errors importing sgl_kernel.flash_attn, install system dependencies:

apt update && apt install libnuma1
pip install sgl-kernel

Page Size Conflicts

If using FlashInfer (fi), ensure page size is set to 1:

python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fi --page-size 1

TensorRT-LLM Page Size

When using trtllm, use supported page sizes:

python -m minisgl --model "Qwen/Qwen3-0.6B" --attn trtllm --page-size 16

Getting Started

Core Concepts

Guides

Configuration

Performance

Supported Backends

Configuration

Hybrid Backend Configuration

Backend Details

FlashAttention (`fa`)

FlashInfer (`fi`)

TensorRT-LLM (`trtllm`)

Default Backend Selection

Page Size Constraints

Performance Recommendations

For High Throughput

For Long Context

For Low Latency

Source Code Reference

Troubleshooting

Import Errors with FlashAttention

Page Size Conflicts

TensorRT-LLM Page Size

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Supported Backends

​Configuration

​Hybrid Backend Configuration

​Backend Details

​FlashAttention (fa)

​FlashInfer (fi)

​TensorRT-LLM (trtllm)

​Default Backend Selection

​Page Size Constraints

​Performance Recommendations

​For High Throughput

​For Long Context

​For Low Latency

​Source Code Reference

​Troubleshooting

​Import Errors with FlashAttention

​Page Size Conflicts

​TensorRT-LLM Page Size

Build docs developers (and LLMs) love

Supported Backends

Configuration

Hybrid Backend Configuration

Backend Details

FlashAttention (`fa`)

FlashInfer (`fi`)

TensorRT-LLM (`trtllm`)

Default Backend Selection

Page Size Constraints

Performance Recommendations

For High Throughput

For Long Context

For Low Latency

Source Code Reference

Troubleshooting

Import Errors with FlashAttention

Page Size Conflicts

TensorRT-LLM Page Size