Skip to main content

Overview

KV cache quantization compresses the key-value attention cache from FP16/BF16 to Int8, significantly reducing memory usage during inference. This enables:
  • Larger batch sizes - Process more requests simultaneously
  • Longer sequences - Generate longer text without OOM errors
  • Higher throughput - Serve more users with the same hardware
KV cache quantization is complementary to GPTQ weight quantization. You can use both together for maximum memory efficiency.

How It Works

During inference, Qwen stores past key-value pairs in the attention cache to avoid recomputing them. This cache grows with sequence length and batch size, consuming significant GPU memory. KV cache quantization:
  1. Converts FP16 keys/values to Int8
  2. Stores quantization parameters (scale, zero point)
  3. Dequantizes on-the-fly during attention computation
# Original format
layer_past = (key, value)  # FP16/BF16

# Quantized format
layer_past = (
    (q_key, key_scale, key_zero_point),
    (q_value, value_scale, value_zero_point)
)  # Int8 + metadata

Enabling KV Cache Quantization

Basic Usage

Enable KV cache quantization when loading the model:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,  # Enable KV cache quantization
    use_cache_kernel=True,         # Use optimized CUDA kernel
    use_flash_attn=False           # Must disable flash attention
).eval()

response, history = model.chat(tokenizer, "Hello!", history=None)
print(response)
Important: KV cache quantization is incompatible with Flash Attention. When you enable use_cache_quantization=True, Flash Attention is automatically disabled even if use_flash_attn=True.

Configuration Parameters

ParameterDescriptionRequired
use_cache_quantizationEnable Int8 KV cache quantizationYes
use_cache_kernelUse optimized CUDA kernel for quantizationRecommended
use_flash_attnMust be False when using KV cache quantizationYes

Required Files

KV cache quantization requires CUDA kernel files:
  • cache_autogptq_cuda_256.cpp
  • cache_autogptq_cuda_kernel_256.cu
HuggingFace Limitation: Due to HuggingFace’s internal mechanisms, these files may be missing from downloaded models.Solution: Manually download them from the model’s HuggingFace repository and place them in the same directory as other model files.
# Example: Download kernel files for Qwen-7B-Chat
cd /path/to/model/directory
wget https://huggingface.co/Qwen/Qwen-7B-Chat/raw/main/cache_autogptq_cuda_256.cpp
wget https://huggingface.co/Qwen/Qwen-7B-Chat/raw/main/cache_autogptq_cuda_kernel_256.cu

Performance Benefits

Memory Usage by Batch Size

KV cache quantization enables processing much larger batches: Qwen-7B on A100-80GB (generating 1024 tokens):
Batch SizeWithout KV CacheWith KV CacheMemory Saved
116.3 GB15.5 GB0.8 GB (5%)
424.1 GB17.2 GB6.9 GB (29%)
1631.7 GB22.3 GB9.4 GB (30%)
3248.7 GB30.2 GB18.5 GB (38%)
64OOM48.2 GBEnables 2x batch
100OOM72.4 GBEnables 3x batch
Key Insight: KV cache quantization provides exponential benefits with larger batch sizes, enabling 2-3x larger batches without OOM errors.

Memory Usage by Sequence Length

KV cache quantization becomes more valuable with longer sequences: Qwen-7B on A100-80GB (batch size 1):
Sequence LengthWithout KV CacheWith KV CacheMemory Saved
512 tokens15.2 GB15.0 GB0.2 GB (1%)
1024 tokens16.3 GB15.5 GB0.8 GB (5%)
2048 tokens17.6 GB15.8 GB1.8 GB (10%)
4096 tokens19.5 GB16.6 GB2.9 GB (15%)
8192 tokens23.2 GB17.6 GB5.6 GB (24%)
Memory savings scale with sequence length. At 8K tokens, KV cache quantization reduces memory by 24%.

Test Environment

Hardware: A100-SXM4-80G GPU
Software: PyTorch 2.0.1, CUDA 11.4
Model: Qwen-7B-Chat (BF16)

Quality Impact

KV cache quantization has minimal impact on model quality. Downstream evaluation shows no significant performance degradation across benchmarks:
BenchmarkWithout KV CacheWith KV CacheDifference
MMLU55.855.7-0.1
C-Eval59.759.6-0.1
GSM8K50.350.2-0.1
HumanEval37.237.0-0.2
Quality degradation is negligible (< 0.2%) across all tasks, making KV cache quantization safe for production use.

Combining with GPTQ

For maximum memory efficiency, combine KV cache quantization with GPTQ weight quantization:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",  # Use Int4 quantized model
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,  # Also quantize KV cache
    use_cache_kernel=True,
    use_flash_attn=False
).eval()

Combined Memory Usage

Qwen-7B generating 2048 tokens:
ConfigurationGPU Memoryvs BF16
BF16 baseline16.99 GB-
Int4 only8.21 GB-52%
BF16 + KV cache quant15.5 GB-9%
Int4 + KV cache quant~7.5 GB-56%
Combining Int4 GPTQ with KV cache quantization can reduce memory usage by over 55% compared to BF16.

Implementation Details

Quantization Process

The quantization process for each key/value tensor:
# 1. Quantize key/value (FP16 → Int8)
qv, scale, zero_point = quantize_cache_v(v)

# 2. Store quantized cache with metadata
layer_past = (
    (q_key, key_scale, key_zero_point),
    (q_value, value_scale, value_zero_point)
)

# 3. Dequantize when needed (Int8 → FP16)
v = dequantize_cache_torch(qv, scale, zero_point)

CUDA Kernel Optimization

When use_cache_kernel=True, optimized CUDA kernels accelerate quantization and dequantization operations:
  • Faster quantization: ~2x speedup vs PyTorch implementation
  • Fused operations: Combine quantization with attention computation
  • Reduced memory transfers: Minimize GPU-CPU data movement
The CUDA kernel provides significant speedup but requires the .cpp and .cu files mentioned earlier.

Use Cases

Scenario: Serving multiple users simultaneously
# Process 64 requests in a single batch
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False
).eval()

# batch_prompts = [prompt1, prompt2, ..., prompt64]
responses = model.generate_batch(batch_prompts)
Benefits:
  • 2-3x larger batch sizes
  • Higher throughput per GPU
  • Better hardware utilization

Troubleshooting

Problem: CUDA kernel files are missing from the model directory.Solution: Manually download the kernel files:
cd /path/to/model
wget https://huggingface.co/Qwen/Qwen-7B-Chat/raw/main/cache_autogptq_cuda_256.cpp
wget https://huggingface.co/Qwen/Qwen-7B-Chat/raw/main/cache_autogptq_cuda_kernel_256.cu
Problem: Configuration not being applied correctly.Solution: Ensure all three parameters are set:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False  # Explicitly disable
)
Problem: CUDA kernel compilation failed or incompatible CUDA version.Solution:
  1. Check CUDA version compatibility (CUDA 11.4+ recommended)
  2. Try without the kernel: use_cache_kernel=False
  3. Verify kernel files are present and not corrupted
Problem: Running without optimized CUDA kernel.Solution: Ensure use_cache_kernel=True and kernel files are present. Without the kernel, PyTorch fallback is slower.
Problem: Batch size or sequence length still too large.Solution:
  1. Combine with GPTQ quantization (Int4/Int8)
  2. Further reduce batch size
  3. Use gradient checkpointing during training
  4. Consider a smaller model variant

Best Practices

1

Download kernel files first

Before enabling KV cache quantization, ensure the CUDA kernel files are in your model directory.
2

Combine with GPTQ for maximum savings

Use Int4 GPTQ models with KV cache quantization for the best memory efficiency.
3

Enable for batch inference

KV cache quantization provides the most benefit when processing multiple requests simultaneously.
4

Profile your workload

Measure memory usage and throughput with and without KV cache quantization to quantify benefits for your specific use case.
5

Use the optimized kernel

Always set use_cache_kernel=True for best performance (requires CUDA kernel files).

Limitations

Current Limitations:
  1. Incompatible with Flash Attention - Cannot use both simultaneously
  2. Requires CUDA kernel files - May need manual download from HuggingFace
  3. CUDA only - CPU inference not supported
  4. Compilation required - First run compiles CUDA kernels (may take a few minutes)

Next Steps

GPTQ Quantization

Learn about GPTQ weight quantization to combine with KV cache

Performance Benchmarks

Detailed performance analysis and optimization strategies

Deployment Guide

Deploy quantized models in production environments

Quantization Overview

Return to quantization overview

Build docs developers (and LLMs) love