Overview
KV cache quantization compresses the key-value attention cache from FP16/BF16 to Int8, significantly reducing memory usage during inference. This enables:- Larger batch sizes - Process more requests simultaneously
- Longer sequences - Generate longer text without OOM errors
- Higher throughput - Serve more users with the same hardware
KV cache quantization is complementary to GPTQ weight quantization. You can use both together for maximum memory efficiency.
How It Works
During inference, Qwen stores past key-value pairs in the attention cache to avoid recomputing them. This cache grows with sequence length and batch size, consuming significant GPU memory. KV cache quantization:- Converts FP16 keys/values to Int8
- Stores quantization parameters (scale, zero point)
- Dequantizes on-the-fly during attention computation
Enabling KV Cache Quantization
Basic Usage
Enable KV cache quantization when loading the model:Configuration Parameters
| Parameter | Description | Required |
|---|---|---|
use_cache_quantization | Enable Int8 KV cache quantization | Yes |
use_cache_kernel | Use optimized CUDA kernel for quantization | Recommended |
use_flash_attn | Must be False when using KV cache quantization | Yes |
Required Files
KV cache quantization requires CUDA kernel files:cache_autogptq_cuda_256.cppcache_autogptq_cuda_kernel_256.cu
Performance Benefits
Memory Usage by Batch Size
KV cache quantization enables processing much larger batches: Qwen-7B on A100-80GB (generating 1024 tokens):| Batch Size | Without KV Cache | With KV Cache | Memory Saved |
|---|---|---|---|
| 1 | 16.3 GB | 15.5 GB | 0.8 GB (5%) |
| 4 | 24.1 GB | 17.2 GB | 6.9 GB (29%) |
| 16 | 31.7 GB | 22.3 GB | 9.4 GB (30%) |
| 32 | 48.7 GB | 30.2 GB | 18.5 GB (38%) |
| 64 | OOM | 48.2 GB | Enables 2x batch |
| 100 | OOM | 72.4 GB | Enables 3x batch |
Memory Usage by Sequence Length
KV cache quantization becomes more valuable with longer sequences: Qwen-7B on A100-80GB (batch size 1):| Sequence Length | Without KV Cache | With KV Cache | Memory Saved |
|---|---|---|---|
| 512 tokens | 15.2 GB | 15.0 GB | 0.2 GB (1%) |
| 1024 tokens | 16.3 GB | 15.5 GB | 0.8 GB (5%) |
| 2048 tokens | 17.6 GB | 15.8 GB | 1.8 GB (10%) |
| 4096 tokens | 19.5 GB | 16.6 GB | 2.9 GB (15%) |
| 8192 tokens | 23.2 GB | 17.6 GB | 5.6 GB (24%) |
Memory savings scale with sequence length. At 8K tokens, KV cache quantization reduces memory by 24%.
Test Environment
Hardware: A100-SXM4-80G GPU
Software: PyTorch 2.0.1, CUDA 11.4
Model: Qwen-7B-Chat (BF16)
Software: PyTorch 2.0.1, CUDA 11.4
Model: Qwen-7B-Chat (BF16)
Quality Impact
KV cache quantization has minimal impact on model quality. Downstream evaluation shows no significant performance degradation across benchmarks:| Benchmark | Without KV Cache | With KV Cache | Difference |
|---|---|---|---|
| MMLU | 55.8 | 55.7 | -0.1 |
| C-Eval | 59.7 | 59.6 | -0.1 |
| GSM8K | 50.3 | 50.2 | -0.1 |
| HumanEval | 37.2 | 37.0 | -0.2 |
Combining with GPTQ
For maximum memory efficiency, combine KV cache quantization with GPTQ weight quantization:Combined Memory Usage
Qwen-7B generating 2048 tokens:| Configuration | GPU Memory | vs BF16 |
|---|---|---|
| BF16 baseline | 16.99 GB | - |
| Int4 only | 8.21 GB | -52% |
| BF16 + KV cache quant | 15.5 GB | -9% |
| Int4 + KV cache quant | ~7.5 GB | -56% |
Implementation Details
Quantization Process
The quantization process for each key/value tensor:CUDA Kernel Optimization
Whenuse_cache_kernel=True, optimized CUDA kernels accelerate quantization and dequantization operations:
- Faster quantization: ~2x speedup vs PyTorch implementation
- Fused operations: Combine quantization with attention computation
- Reduced memory transfers: Minimize GPU-CPU data movement
The CUDA kernel provides significant speedup but requires the
.cpp and .cu files mentioned earlier.Use Cases
- Large Batch Processing
- Long-Context Generation
- Resource-Constrained Deployment
Scenario: Serving multiple users simultaneouslyBenefits:
- 2-3x larger batch sizes
- Higher throughput per GPU
- Better hardware utilization
Troubleshooting
FileNotFoundError: cache_autogptq_cuda_256.cpp not found
FileNotFoundError: cache_autogptq_cuda_256.cpp not found
Problem: CUDA kernel files are missing from the model directory.Solution: Manually download the kernel files:
Model still using flash attention despite use_flash_attn=False
Model still using flash attention despite use_flash_attn=False
Problem: Configuration not being applied correctly.Solution: Ensure all three parameters are set:
RuntimeError: CUDA error during quantization
RuntimeError: CUDA error during quantization
Problem: CUDA kernel compilation failed or incompatible CUDA version.Solution:
- Check CUDA version compatibility (CUDA 11.4+ recommended)
- Try without the kernel:
use_cache_kernel=False - Verify kernel files are present and not corrupted
Slower inference than expected
Slower inference than expected
Problem: Running without optimized CUDA kernel.Solution: Ensure
use_cache_kernel=True and kernel files are present. Without the kernel, PyTorch fallback is slower.OOM error even with KV cache quantization
OOM error even with KV cache quantization
Problem: Batch size or sequence length still too large.Solution:
- Combine with GPTQ quantization (Int4/Int8)
- Further reduce batch size
- Use gradient checkpointing during training
- Consider a smaller model variant
Best Practices
Download kernel files first
Before enabling KV cache quantization, ensure the CUDA kernel files are in your model directory.
Combine with GPTQ for maximum savings
Use Int4 GPTQ models with KV cache quantization for the best memory efficiency.
Enable for batch inference
KV cache quantization provides the most benefit when processing multiple requests simultaneously.
Profile your workload
Measure memory usage and throughput with and without KV cache quantization to quantify benefits for your specific use case.
Limitations
Next Steps
GPTQ Quantization
Learn about GPTQ weight quantization to combine with KV cache
Performance Benchmarks
Detailed performance analysis and optimization strategies
Deployment Guide
Deploy quantized models in production environments
Quantization Overview
Return to quantization overview