Overview
During autoregressive generation, LLMs cache previously computed key-value pairs to avoid redundant calculations. The KV cache typically consumes a significant portion of GPU memory, especially for long sequences. Quantized KV cache is a memory optimization technique that primarily benefits throughput by allowing more tokens to be cached, but may introduce minimal accuracy degradation depending on the quantization format used.Supported Formats
FP8 Format
OCP (Open Compute Project) specifies two common 8-bit floating point formats:E5M2
5 exponent bits, 2 mantissa bits
- Larger dynamic range (±57344.0)
- Lower precision
- Better for values with wide range
E4M3
4 exponent bits, 3 mantissa bits
- Higher precision
- Smaller dynamic range (±240.0)
- Recommended for most use cases
FP4 Format (Experimental)
OCP (Open Compute Project) specifies MXFP4 (Microscaling FP4), a 4-bit floating-point format: E2M1 (1 sign bit, 2 exponent bits, 1 mantissa bit):- Uses block-based microscaling where tensors are divided into blocks of consecutive elements
- Each block shares a single 8-bit exponential scaling factor
- OCP specifies blocks of 32 elements; SGLang currently uses blocks of 16 elements
- Scaling factors computed dynamically on-the-fly (no pre-quantization required)
Usage
Enabling Quantized KV Cache
Scaling Factors
FP8 quantization requires scaling factors to properly quantize and dequantize the KV cache.Currently, only per-tensor (scalar) scaling factors are supported.
- Loaded from checkpoints: Pre-quantized models (e.g., ModelOpt) may include
k_scaleandv_scaleparameters that are automatically loaded - Provided via JSON: Supply scaling factors via
--quantization-param-path
Performance Considerations
Memory Savings
Quantized KV cache provides significant memory savings:| Format | Tokens Supported (vs BF16) |
|---|---|
| BF16 | 1.00× (baseline) |
| FP8 | ~2.00× |
| FP4 | ~3.56× |
FP4 and FP8 quantization require additional memory for block-based scaling factors, which reduces the effective memory savings compared to the raw bit-width reduction. The ratios above account for this overhead.
- Longer context lengths within the same memory budget
- More concurrent requests for improved throughput
- Better GPU utilization by reducing KV cache memory pressure
Accuracy Impact
FP8 Accuracy
FP8 E4M3 quantization typically introduces minimal accuracy degradation. The impact depends on:- Model architecture
- Sequence length
- Quantization format (E4M3 generally has better accuracy than E5M2)
FP4 Accuracy
FP4 (MXFP4) quantization provides significant memory savings with varying accuracy impact depending on model size and dataset complexity. Large Models (200B+ parameters) On large-scale models, FP4 maintains accuracy close to FP8/BF16, especially on simpler datasets:| Model | Dataset | BF16 | FP8 E4M3 | FP4 E2M1 |
|---|---|---|---|---|
| Qwen3-235B-A22B | gsm8k | 0.9168 | 0.9181 | 0.9186 |
| Qwen3-235B-A22B | aime25 | 0.7733 | 0.7333 | 0.6000 |
| Qwen3-235B-A22B | gpqa_diamond | 0.7010 | 0.6899 | 0.6778 |
| DeepSeek-R1-0528 | gsm8k | 0.9157 | 0.9154 | 0.9124 |
| DeepSeek-R1-0528 | aime25 | 0.5067 | 0.4934 | 0.4000 |
| DeepSeek-R1-0528 | gpqa_diamond | 0.7707 | 0.7697 | 0.7273 |
| Model | Dataset | BF16 | FP8 E4M3 | FP4 E2M1 |
|---|---|---|---|---|
| GPT-OSS-120B | gsm8k | 0.9161 | 0.9163 | 0.9152 |
| GPT-OSS-120B | aime25 | 0.7533 | 0.7667 | 0.3533 |
| GPT-OSS-120B | gpqa_diamond | 0.5081 | 0.5434 | 0.3202 |
- Simple datasets (e.g., gsm8k): FP4 maintains accuracy close to FP8/BF16 across model sizes
- Model size matters: Large models (200B+ parameters) generally tolerate FP4 quantization better than smaller models
- Context length: Accuracy degradation may be more pronounced in long-context scenarios due to accumulation of quantization error
Backend Compatibility
Not all attention backends support quantized KV cache. Refer to the support matrix:MHA Backends
| Backend | FP8 KV Cache | FP4 KV Cache |
|---|---|---|
| FlashInfer | ✅ | ❌ |
| FA3 (FlashAttention 3) | ✅ | ❌ |
| FA4 (FlashAttention 4) | ❌ | ✅ |
| Triton | ❌ | ✅ |
| Torch Native (SDPA) | ✅ | ✅ |
| TRTLLM MHA | ✅ | ✅ |
| AITER (ROCm) | ✅ | ❌ |
MLA Backends
| Backend | FP8 KV Cache | FP4 KV Cache |
|---|---|---|
| FlashInfer MLA | ❌ | ✅ |
| FlashMLA | ✅ | ✅ |
| Cutlass MLA | ✅ | ✅ |
| TRTLLM MLA (Blackwell) | ✅ | ✅ |
| FA3 (FlashAttention 3) | ❌ | ❌ |
| FA4 | ❌ | ✅ |
Examples
DeepSeek-R1 with FP8 KV Cache
Qwen3-235B with FP4 KV Cache
Pre-quantized Model with Custom Scaling Factors
Best Practices
Use Pre-quantized Models
Prefer models quantized offline with scaling factors included in the checkpoint for best accuracy.
Choose the Right Format
Use
fp8_e4m3 for better accuracy (recommended), fp8_e5m2 for larger dynamic range, or fp4_e2m1 for maximum memory savings (experimental).Check Backend Compatibility
Verify that your chosen attention backend supports quantized KV cache with fused dequantization.
Evaluate Accuracy
Test FP4/FP8 accuracy on your specific workload before production deployment, especially for complex reasoning tasks.
Troubleshooting
Performance Degradation
If quantized KV cache degrades performance:- Check backend support: Verify your attention backend supports quantized KV cache with fused dequantization
- Try different formats: FP8 may perform better than FP4 on some backends
- Monitor memory bandwidth: Quantization reduces memory footprint but increases compute
Accuracy Issues
If you observe accuracy degradation:- Verify scaling factors: Ensure scaling factors are properly loaded or provided
- Try FP8 E4M3: Switch from FP4 or E5M2 to E4M3 for better accuracy
- Evaluate on your dataset: Test on representative samples before full deployment
- Consider model size: Smaller models may require higher precision
Missing Scaling Factors
If scaling factors default to 1.0:- Check checkpoint: Verify the model includes
k_scaleandv_scaleparameters - Provide JSON file: Use
--quantization-param-pathto supply custom scaling factors - Use pre-quantized models: Download models from Unsloth, NVIDIA ModelOpt, or NeuralMagic collections
