KV Cache Quantization

Overview

KV cache quantization compresses the key-value attention cache from FP16/BF16 to Int8, significantly reducing memory usage during inference. This enables:

Larger batch sizes - Process more requests simultaneously
Longer sequences - Generate longer text without OOM errors
Higher throughput - Serve more users with the same hardware

KV cache quantization is complementary to GPTQ weight quantization. You can use both together for maximum memory efficiency.

How It Works

During inference, Qwen stores past key-value pairs in the attention cache to avoid recomputing them. This cache grows with sequence length and batch size, consuming significant GPU memory. KV cache quantization:

Converts FP16 keys/values to Int8
Stores quantization parameters (scale, zero point)
Dequantizes on-the-fly during attention computation

# Original format
layer_past = (key, value)  # FP16/BF16

# Quantized format
layer_past = (
    (q_key, key_scale, key_zero_point),
    (q_value, value_scale, value_zero_point)
)  # Int8 + metadata

Enabling KV Cache Quantization

Basic Usage

Enable KV cache quantization when loading the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,  # Enable KV cache quantization
    use_cache_kernel=True,         # Use optimized CUDA kernel
    use_flash_attn=False           # Must disable flash attention
).eval()

response, history = model.chat(tokenizer, "Hello!", history=None)
print(response)

Important: KV cache quantization is incompatible with Flash Attention. When you enable use_cache_quantization=True, Flash Attention is automatically disabled even if use_flash_attn=True.

Configuration Parameters

Parameter	Description	Required
`use_cache_quantization`	Enable Int8 KV cache quantization	Yes
`use_cache_kernel`	Use optimized CUDA kernel for quantization	Recommended
`use_flash_attn`	Must be `False` when using KV cache quantization	Yes

Required Files

KV cache quantization requires CUDA kernel files:

cache_autogptq_cuda_256.cpp
cache_autogptq_cuda_kernel_256.cu

HuggingFace Limitation: Due to HuggingFace’s internal mechanisms, these files may be missing from downloaded models.Solution: Manually download them from the model’s HuggingFace repository and place them in the same directory as other model files.

# Example: Download kernel files for Qwen-7B-Chat
cd /path/to/model/directory
wget https://huggingface.co/Qwen/Qwen-7B-Chat/raw/main/cache_autogptq_cuda_256.cpp
wget https://huggingface.co/Qwen/Qwen-7B-Chat/raw/main/cache_autogptq_cuda_kernel_256.cu

Performance Benefits

Memory Usage by Batch Size

KV cache quantization enables processing much larger batches: Qwen-7B on A100-80GB (generating 1024 tokens):

Batch Size	Without KV Cache	With KV Cache	Memory Saved
1	16.3 GB	15.5 GB	0.8 GB (5%)
4	24.1 GB	17.2 GB	6.9 GB (29%)
16	31.7 GB	22.3 GB	9.4 GB (30%)
32	48.7 GB	30.2 GB	18.5 GB (38%)
64	OOM	48.2 GB	Enables 2x batch
100	OOM	72.4 GB	Enables 3x batch

Key Insight: KV cache quantization provides exponential benefits with larger batch sizes, enabling 2-3x larger batches without OOM errors.

Memory Usage by Sequence Length

KV cache quantization becomes more valuable with longer sequences: Qwen-7B on A100-80GB (batch size 1):

Sequence Length	Without KV Cache	With KV Cache	Memory Saved
512 tokens	15.2 GB	15.0 GB	0.2 GB (1%)
1024 tokens	16.3 GB	15.5 GB	0.8 GB (5%)
2048 tokens	17.6 GB	15.8 GB	1.8 GB (10%)
4096 tokens	19.5 GB	16.6 GB	2.9 GB (15%)
8192 tokens	23.2 GB	17.6 GB	5.6 GB (24%)

Memory savings scale with sequence length. At 8K tokens, KV cache quantization reduces memory by 24%.

Test Environment

Hardware: A100-SXM4-80G GPU
Software: PyTorch 2.0.1, CUDA 11.4
Model: Qwen-7B-Chat (BF16)

Quality Impact

KV cache quantization has minimal impact on model quality. Downstream evaluation shows no significant performance degradation across benchmarks:

Benchmark	Without KV Cache	With KV Cache	Difference
MMLU	55.8	55.7	-0.1
C-Eval	59.7	59.6	-0.1
GSM8K	50.3	50.2	-0.1
HumanEval	37.2	37.0	-0.2

Quality degradation is negligible (< 0.2%) across all tasks, making KV cache quantization safe for production use.

Combining with GPTQ

For maximum memory efficiency, combine KV cache quantization with GPTQ weight quantization:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",  # Use Int4 quantized model
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,  # Also quantize KV cache
    use_cache_kernel=True,
    use_flash_attn=False
).eval()

Combined Memory Usage

Qwen-7B generating 2048 tokens:

Configuration	GPU Memory	vs BF16
BF16 baseline	16.99 GB	-
Int4 only	8.21 GB	-52%
BF16 + KV cache quant	15.5 GB	-9%
Int4 + KV cache quant	~7.5 GB	-56%

Combining Int4 GPTQ with KV cache quantization can reduce memory usage by over 55% compared to BF16.

Implementation Details

Quantization Process

The quantization process for each key/value tensor:

# 1. Quantize key/value (FP16 → Int8)
qv, scale, zero_point = quantize_cache_v(v)

# 2. Store quantized cache with metadata
layer_past = (
    (q_key, key_scale, key_zero_point),
    (q_value, value_scale, value_zero_point)
)

# 3. Dequantize when needed (Int8 → FP16)
v = dequantize_cache_torch(qv, scale, zero_point)

CUDA Kernel Optimization

When use_cache_kernel=True, optimized CUDA kernels accelerate quantization and dequantization operations:

Faster quantization: ~2x speedup vs PyTorch implementation
Fused operations: Combine quantization with attention computation
Reduced memory transfers: Minimize GPU-CPU data movement

The CUDA kernel provides significant speedup but requires the .cpp and .cu files mentioned earlier.

Use Cases

Large Batch Processing
Long-Context Generation
Resource-Constrained Deployment

Scenario: Serving multiple users simultaneously

# Process 64 requests in a single batch
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False
).eval()

# batch_prompts = [prompt1, prompt2, ..., prompt64]
responses = model.generate_batch(batch_prompts)

Benefits:

2-3x larger batch sizes
Higher throughput per GPU
Better hardware utilization

Scenario: Generating very long documents or maintaining long conversations

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False
).eval()

# Generate 8192 tokens without OOM
response, history = model.chat(
    tokenizer,
    "Write a detailed 5000-word essay...",
    history=None
)

Benefits:

Support for 8K+ token generation
Reduced memory growth with sequence length
Stable memory usage

Scenario: Deploying on consumer GPUs (RTX 3090, RTX 4090)

# Deploy Qwen-7B on 24GB GPU
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False
).eval()

Benefits:

Run larger models on smaller GPUs
Reduce cloud compute costs
Enable edge deployment

Troubleshooting

FileNotFoundError: cache_autogptq_cuda_256.cpp not found

Problem: CUDA kernel files are missing from the model directory.Solution: Manually download the kernel files:

cd /path/to/model
wget https://huggingface.co/Qwen/Qwen-7B-Chat/raw/main/cache_autogptq_cuda_256.cpp
wget https://huggingface.co/Qwen/Qwen-7B-Chat/raw/main/cache_autogptq_cuda_kernel_256.cu

Model still using flash attention despite use_flash_attn=False

Problem: Configuration not being applied correctly.Solution: Ensure all three parameters are set:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False  # Explicitly disable
)

RuntimeError: CUDA error during quantization

Problem: CUDA kernel compilation failed or incompatible CUDA version.Solution:

Check CUDA version compatibility (CUDA 11.4+ recommended)
Try without the kernel: use_cache_kernel=False
Verify kernel files are present and not corrupted

Slower inference than expected

Problem: Running without optimized CUDA kernel.Solution: Ensure use_cache_kernel=True and kernel files are present. Without the kernel, PyTorch fallback is slower.

OOM error even with KV cache quantization

Problem: Batch size or sequence length still too large.Solution:

Combine with GPTQ quantization (Int4/Int8)
Further reduce batch size
Use gradient checkpointing during training
Consider a smaller model variant

Best Practices

Download kernel files first

Before enabling KV cache quantization, ensure the CUDA kernel files are in your model directory.

Combine with GPTQ for maximum savings

Use Int4 GPTQ models with KV cache quantization for the best memory efficiency.

Enable for batch inference

KV cache quantization provides the most benefit when processing multiple requests simultaneously.

Profile your workload

Measure memory usage and throughput with and without KV cache quantization to quantify benefits for your specific use case.

Use the optimized kernel

Always set use_cache_kernel=True for best performance (requires CUDA kernel files).

Limitations

Current Limitations:

Incompatible with Flash Attention - Cannot use both simultaneously
Requires CUDA kernel files - May need manual download from HuggingFace
CUDA only - CPU inference not supported
Compilation required - First run compiles CUDA kernels (may take a few minutes)

Next Steps

GPTQ Quantization

Learn about GPTQ weight quantization to combine with KV cache

Performance Benchmarks

Detailed performance analysis and optimization strategies

Deployment Guide

Deploy quantized models in production environments

Quantization Overview

Return to quantization overview

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

KV Cache Quantization

Overview

How It Works

Enabling KV Cache Quantization

Basic Usage

Configuration Parameters

Required Files

Performance Benefits

Memory Usage by Batch Size

Memory Usage by Sequence Length

Test Environment

Quality Impact

Combining with GPTQ

Combined Memory Usage

Implementation Details

Quantization Process

CUDA Kernel Optimization

Use Cases

Troubleshooting

Best Practices

Limitations

Next Steps

GPTQ Quantization

Performance Benchmarks

Deployment Guide

Quantization Overview

Build docs developers (and LLMs) love

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Overview

​How It Works

​Enabling KV Cache Quantization

​Basic Usage

​Configuration Parameters

​Required Files

​Performance Benefits

​Memory Usage by Batch Size

​Memory Usage by Sequence Length

​Test Environment

​Quality Impact

​Combining with GPTQ

​Combined Memory Usage

​Implementation Details

​Quantization Process

​CUDA Kernel Optimization

​Use Cases

​Troubleshooting

​Best Practices

​Limitations

​Next Steps

GPTQ Quantization

Performance Benchmarks

Deployment Guide

Quantization Overview

Build docs developers (and LLMs) love

Overview

How It Works

Enabling KV Cache Quantization

Basic Usage

Configuration Parameters

Required Files

Performance Benefits

Memory Usage by Batch Size

Memory Usage by Sequence Length

Test Environment

Quality Impact

Combining with GPTQ

Combined Memory Usage

Implementation Details

Quantization Process

CUDA Kernel Optimization

Use Cases

Troubleshooting

Best Practices

Limitations

Next Steps