Prefix Caching

Prefix caching is a key optimization in SGLang that allows multiple requests to share Key-Value (KV) cache for common token sequences. This dramatically reduces memory usage and computation time, especially for workloads with repeated context.

How Prefix Caching Works

When processing LLM requests, SGLang automatically:

Detects shared prefixes between incoming requests and cached sequences
Reuses cached KV states instead of recomputing attention
Stores completed requests in the cache for future reuse
Evicts old entries when memory is full

Prefix caching happens automatically - no application changes needed. Just send your requests and SGLang handles the rest.

Cache Lifecycle

1. Request Arrival

When a request arrives, the scheduler searches for matching prefixes:

# From schedule_policy.py
def _compute_prefix_matches(self, waiting_queue, policy):
    for r in waiting_queue:
        prefix_ids = r.origin_input_ids + r.output_ids
        
        # Find longest matching prefix in cache
        match_result = self.tree_cache.match_prefix(
            MatchPrefixParams(
                key=RadixKey(token_ids=prefix_ids, extra_key=r.extra_key)
            )
        )
        
        # Store matched indices and tree node
        r.prefix_indices = match_result.device_indices
        r.last_node = match_result.last_device_node

Reference: python/sglang/srt/managers/schedule_policy.py:182-211

2. KV Cache Allocation

For uncached portions, new KV cache is allocated:

# Two-level memory pool
class ReqToTokenPool:
    def alloc(self, reqs: List[Req]) -> Optional[List[int]]:
        """Allocate request slots in the pool."""
        need_size = len(reqs)
        if need_size > len(self.free_slots):
            return None
        
        select_index = self.free_slots[:need_size]
        self.free_slots = self.free_slots[need_size:]
        
        for r in reqs:
            r.req_pool_idx = select_index[offset]
            offset += 1
        
        return [r.req_pool_idx for r in reqs]

Reference: python/sglang/srt/mem_cache/memory_pool.py:155-177

3. Attention Computation

During forward pass, attention uses both cached and new KV states:

class RadixAttention(nn.Module):
    def forward(self, q, k, v, forward_batch, save_kv_cache=True, **kwargs):
        # k, v contain only NEW tokens
        # Cached tokens accessed via forward_batch.req_to_token_pool
        
        return forward_batch.attn_backend.forward(
            q, k, v, self, forward_batch, save_kv_cache, **kwargs
        )

The attention backend (e.g., FlashInfer) receives:

New KV states: For uncached tokens
Cache indices: Pointers to cached KV states
Sequence lengths: Total length including cached portion

Reference: python/sglang/srt/layers/radix_attention.py:99-135

4. Request Completion

When a request finishes, its KV cache is inserted into the tree:

def cache_finished_req(self, req: Req, is_insert: bool = True):
    """Cache request when it finishes."""
    kv_committed_len = req.pop_committed_kv_cache()
    
    # Get token IDs and KV indices
    token_ids = (req.origin_input_ids + req.output_ids)[:kv_committed_len]
    kv_indices = self.req_to_token_pool.req_to_token[
        req.req_pool_idx, :len(token_ids)
    ]
    
    # Convert to bigram keys for EAGLE if needed
    keys = convert_to_bigram_key(token_ids) if self.is_eagle else token_ids
    keys = page_align_keys(keys, self.page_size)
    
    # Insert into radix tree
    if is_insert:
        result = self.insert(
            InsertParams(
                key=RadixKey(keys, req.extra_key, is_bigram=self.is_eagle),
                value=kv_indices[:len(keys)],
                priority=req.priority
            )
        )
        
        # Free duplicate KV cache
        self.token_to_kv_pool_allocator.free(
            kv_indices[req.cache_protected_len:result.prefix_len]
        )

Reference: python/sglang/srt/mem_cache/radix_cache.py:459-504

Already-cached portions are freed to avoid duplication. The tree maintains a single copy of each unique prefix.

Use Cases

System Prompts

Many applications use the same system prompt for every request:

system_prompt = """You are a helpful assistant that answers questions 
about Python programming. Always provide code examples."""

# First request: Full prefill (800 tokens)
request1 = system_prompt + "\nUser: How do I read a file?"

# Second request: Only processes new part (30 tokens)
request2 = system_prompt + "\nUser: How do I write a file?"

# Third request: Only processes new part (35 tokens)  
request3 = system_prompt + "\nUser: How do I delete a file?"

With prefix caching, the 800-token system prompt is processed once and reused, reducing prefill time by ~96% for subsequent requests.

Few-Shot Examples

Few-shot prompting with examples benefits greatly from caching:

few_shot_prefix = """Translate English to French:

English: Hello
French: Bonjour

English: Goodbye  
French: Au revoir

English: Thank you
French: Merci

"""  # ~100 tokens, cached

# Only "Good morning" needs processing
request = few_shot_prefix + "English: Good morning\nFrench:"

Multi-Turn Conversations

Conversation history is incrementally cached:

# Turn 1
conv1 = "User: What's the weather?\nAssistant: It's sunny today.\n"
# Turn 2: Reuses conv1 prefix
conv2 = conv1 + "User: Should I bring an umbrella?\nAssistant: No need, it's clear.\n" 
# Turn 3: Reuses conv2 prefix
conv3 = conv2 + "User: What about tomorrow?\nAssistant: "

Document-Based QA

Long documents can be cached and queried multiple times:

# First request: Document is processed (5000 tokens)
request1 = f"{long_document}\n\nQuestion: What is X?\nAnswer:"

# Subsequent requests: Only question is processed (~20 tokens each)
request2 = f"{long_document}\n\nQuestion: What is Y?\nAnswer:"
request3 = f"{long_document}\n\nQuestion: What is Z?\nAnswer:"

This pattern is common in RAG (Retrieval-Augmented Generation) applications where the same document is queried multiple times.

Memory Management

Protected vs Evictable Memory

Cache memory is divided into two categories:

class RadixCache:
    def inc_lock_ref(self, node: TreeNode):
        """Lock nodes used by active requests."""
        while node != self.root_node:
            if node.lock_ref == 0:
                # Move from evictable to protected
                self.evictable_size_ -= len(node.key)
                self.protected_size_ += len(node.key)
            node.lock_ref += 1
            node = node.parent
    
    def dec_lock_ref(self, node: TreeNode):
        """Unlock nodes when requests finish."""
        while node != self.root_node:
            if node.lock_ref == 1:
                # Move from protected to evictable
                self.evictable_size_ += len(node.key)
                self.protected_size_ -= len(node.key)
            node.lock_ref -= 1
            node = node.parent

Reference: python/sglang/srt/mem_cache/radix_cache.py:607-639

Eviction Triggers

Eviction occurs when:

Memory pressure: Not enough space for new requests
Scheduled eviction: Periodic cleanup of old entries
Explicit flush: Manual cache clearing

def evict(self, params: EvictParams) -> EvictResult:
    """Evict num_tokens from cache."""
    num_tokens = params.num_tokens
    leaves = list(self.evictable_leaves)  # Only leaves can be evicted
    
    # Build priority heap
    eviction_heap = [
        (self.eviction_strategy.get_priority(node), node)
        for node in leaves
    ]
    heapq.heapify(eviction_heap)
    
    # Evict until target reached
    num_evicted = 0
    while num_evicted < num_tokens and len(eviction_heap):
        _priority, x = heapq.heappop(eviction_heap)
        
        self.token_to_kv_pool_allocator.free(x.value)
        num_evicted += len(x.value)
        self._delete_leaf(x)
        
        # Parent might become leaf now
        if len(x.parent.children) == 0 and x.parent.lock_ref == 0:
            heapq.heappush(
                eviction_heap,
                (self.eviction_strategy.get_priority(x.parent), x.parent)
            )

Reference: python/sglang/srt/mem_cache/radix_cache.py:578-605

Only leaf nodes (endpoints of cached sequences) can be evicted initially. As leaves are removed, their parents may become new leaves and become evictable.

Eviction Policies

Choose the policy that best fits your workload:

LRU (Least Recently Used)

Best for: General-purpose caching with temporal locality

class LRUStrategy:
    def get_priority(self, node: TreeNode) -> float:
        return node.last_access_time  # Lower = evict first

Use when recent requests are likely to be repeated soon.

LFU (Least Frequently Used)

Best for: Caching popular prefixes

class LFUStrategy:
    def get_priority(self, node: TreeNode) -> float:
        return node.hit_count  # Lower = evict first

Keeps frequently-accessed prefixes even if not accessed recently.

FIFO (First In First Out)

Best for: Simple, predictable eviction Evicts oldest inserted nodes first, regardless of usage.

Priority-Based

Best for: Multi-tenant or QoS-aware systems

class PriorityStrategy:
    def get_priority(self, node: TreeNode) -> float:
        return -node.priority  # Higher priority = lower eviction chance

Requests can specify priority levels to influence cache retention.

For most use cases, LRU provides the best balance of hit rate and simplicity. Use priority-based eviction for production systems with different SLA tiers.

Advanced Features

In-Batch Prefix Caching

SGLang detects shared prefixes even within a single batch:

# These requests share "What is the capital of"
requests = [
    "What is the capital of France?",
    "What is the capital of Germany?", 
    "What is the capital of Italy?",
]

# Scheduler prioritizes the first request
# Others wait for the prefix to be cached
# Then execute with the cached prefix

Reference: python/sglang/srt/managers/schedule_policy.py:213-240

Extra Key Namespace

Separate cache namespaces using extra_key:

class RadixKey:
    def __init__(
        self,
        token_ids: List[int],
        extra_key: Optional[str] = None,  # e.g., "lora_id:customer1"
        is_bigram: bool = False,
    ):
        self.token_ids = token_ids
        self.extra_key = extra_key

Use cases:

LoRA adapters: Separate cache per adapter
Multi-tenancy: Isolate cache by customer
Cache versioning: Invalidate by changing version

Reference: python/sglang/srt/mem_cache/radix_cache.py:67-79

Page Alignment

For memory efficiency, cache can be aligned to page boundaries:

# With page_size=16
sequence = [1, 2, 3, ..., 35]  # 35 tokens

# Only first 32 tokens cached (2 complete pages)
# Last 3 tokens not cached (incomplete page)
cached_portion = sequence[:32]

This ensures:

Efficient memory allocation: No fragmented pages
Better sharing: Only complete pages are shared
Aligned access: Hardware-friendly memory access patterns

Set --page-size based on your hardware and model. Typical values: 1 (no alignment), 8, 16, or 32.

Performance Metrics

Cache Hit Rate

Monitor cache effectiveness:

hit_rate = total_cached_tokens / total_tokens_processed

Typical hit rates:

RAG workloads: 80-95%
Few-shot prompting: 70-90%
Multi-turn chat: 60-80%
Diverse queries: 20-40%

Memory Usage

Track cache memory:

total_cache = tree_cache.protected_size() + tree_cache.evictable_size()
active_requests = tree_cache.protected_size()
available_for_reuse = tree_cache.evictable_size()

Time Savings

Measure prefill time reduction:

# Without caching
ttft_no_cache = num_prefix_tokens * time_per_token

# With caching
ttft_cached = num_uncached_tokens * time_per_token

# Speedup
speedup = ttft_no_cache / ttft_cached

Configuration

Key configuration options:

# Enable/disable prefix caching
--disable-radix-cache  # Disable (not recommended)

# Set eviction policy  
--radix-eviction-policy lru  # Options: lru, lfu, fifo, mru, filo, priority

# Memory allocation
--mem-fraction-static 0.9  # GPU memory for KV cache

# Page alignment
--page-size 16  # Align cache to 16-token pages

# Chunked prefill (for large prefixes)
--chunked-prefill-size 2048  # Split large prefills

Best Practices

Design for reuse: Structure prompts with common prefixes
Monitor hit rates: Low hit rates indicate poor prompt design
Use appropriate eviction: Match policy to workload characteristics
Set sufficient memory: Undersized cache leads to thrashing
Leverage extra_key: Isolate cache when needed

For maximum benefit, organize your application to send requests with shared context consecutively or in batches.

Limitations

Exact token match required: Even one different token breaks the prefix
Special tokens matter: System tokens must match exactly
Tokenizer dependence: Same text with different tokenizers won’t match
Memory overhead: Tree structure adds ~10-20% overhead
Eviction cost: Cache lookup and eviction have CPU overhead

RadixAttention - The attention mechanism
System Architecture - Overall system design
Continuous Batching - Dynamic batching

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​How Prefix Caching Works

​Cache Lifecycle

​1. Request Arrival

​2. KV Cache Allocation

​3. Attention Computation

​4. Request Completion

​Use Cases

​System Prompts

​Few-Shot Examples

​Multi-Turn Conversations

​Document-Based QA

​Memory Management

​Protected vs Evictable Memory

​Eviction Triggers

​Eviction Policies

​LRU (Least Recently Used)

​LFU (Least Frequently Used)

​FIFO (First In First Out)

​Priority-Based

​Advanced Features

​In-Batch Prefix Caching

​Extra Key Namespace

​Page Alignment

​Performance Metrics

​Cache Hit Rate

​Memory Usage

​Time Savings

​Configuration

​Best Practices

​Limitations

​Related Topics

How Prefix Caching Works

Cache Lifecycle

1. Request Arrival

2. KV Cache Allocation

3. Attention Computation

4. Request Completion

Use Cases

System Prompts

Few-Shot Examples

Multi-Turn Conversations

Document-Based QA

Memory Management

Protected vs Evictable Memory

Eviction Triggers

Eviction Policies

LRU (Least Recently Used)

LFU (Least Frequently Used)

FIFO (First In First Out)

Priority-Based

Advanced Features

In-Batch Prefix Caching

Extra Key Namespace

Page Alignment

Performance Metrics

Cache Hit Rate

Memory Usage

Time Savings

Configuration

Best Practices

Limitations

Related Topics