Cache Management

Mini-SGLang provides two KV cache management strategies: Radix Cache for automatic prefix sharing across requests, and Naive Cache for simple per-request isolation. The cache strategy is controlled by the --cache-type flag.

Overview

KV (Key-Value) cache stores intermediate attention states during LLM inference. Effective cache management can dramatically reduce computation by reusing cached states for shared prompt prefixes.

Cache Strategies

Radix Cache (Default)

Radix cache implements a radix tree data structure to automatically detect and reuse shared prefixes across requests. This is the default cache management strategy. Key Features:

Automatic prefix matching and reuse
LRU-based eviction for cache management
Supports dynamic node splitting and merging
Reduces redundant computation for common prefixes

Implementation: minisgl.kvcache.radix_cache.RadixPrefixCache (source:~/workspace/source/python/minisgl/kvcache/radix_cache.py:101) When to Use:

Multi-turn conversations with shared history
Batch processing with common system prompts
Few-shot prompting with repeated examples
RAG (Retrieval-Augmented Generation) with shared context

Naive Cache

Naive cache provides simple per-request cache isolation without any prefix sharing. Key Features:

No prefix matching or sharing
Minimal memory overhead
Simpler implementation for debugging
Each request maintains independent cache

Implementation: minisgl.kvcache.naive_cache.NaivePrefixCache (source:~/workspace/source/python/minisgl/kvcache/naive_cache.py:16) When to Use:

Unique requests with no shared prefixes
Testing and debugging
Benchmarking without cache effects
Simple single-request scenarios

Configuration

Use the --cache-type flag to select the cache strategy:

# Use radix cache (default)
python -m minisgl --model "Qwen/Qwen3-0.6B" --cache-type radix

# Use naive cache
python -m minisgl --model "Qwen/Qwen3-0.6B" --cache-type naive

--cache-type

string

default:"radix"

KV cache management strategy.Choices: radix, naive

radix: Enables automatic prefix sharing across requests
naive: Disables prefix sharing, each request has independent cache

Source: minisgl.server.args.py:206

Radix Cache Details

How It Works

Radix cache organizes KV cache entries in a radix tree structure:

Tree Structure: Each node represents a prefix of token IDs
Prefix Matching: Incoming requests traverse the tree to find matching prefixes
Cache Reuse: Matched nodes provide cached KV states
Dynamic Splitting: Nodes split when partial matches occur
LRU Eviction: Least recently used nodes are evicted when memory is needed

Illustration from LMSYS Blog

Tree Operations

Prefix Matching

Source: radix_cache.py:132

def match_prefix(self, input_ids: torch.Tensor) -> MatchResult:
    # Walks the radix tree to find longest matching prefix
    # Returns handle with cached indices

Insertion

Source: radix_cache.py:136

def insert_prefix(self, input_ids: torch.Tensor, indices: torch.Tensor) -> InsertResult:
    # Inserts new prefix into tree (aligned to page_size)
    # May create new nodes or split existing ones

Eviction

Source: radix_cache.py:148

def evict(self, size: int) -> torch.Tensor:
    # Evicts LRU leaf nodes to free specified size
    # Returns indices of evicted pages

Node Structure

Source: radix_cache.py:17 Each RadixTreeNode contains:

_key: Token IDs for this node
_value: Physical page indices in KV cache
_length: Number of tokens
children: Dictionary of child nodes
ref_count: Reference count for protection from eviction
timestamp: LRU timestamp

Memory Management

Radix cache tracks two types of memory:

Protected Size: Nodes currently in use (ref_count > 0)
Evictable Size: Unused nodes available for eviction (ref_count == 0)

Source: radix_cache.py:181

@property
def size_info(self) -> SizeInfo:
    return SizeInfo(
        evictable_size=self.evictable_size,
        protected_size=self.protected_size,
    )

Page Alignment

Radix cache aligns all operations to page boundaries: Source: radix_cache.py:137

insert_len = align_down(len(input_ids), self.page_size)

This ensures efficient memory management and compatibility with paged attention backends.

Naive Cache Details

Implementation

Naive cache provides minimal implementation: Source: naive_cache.py:16

class NaivePrefixCache(BasePrefixCache):
    def match_prefix(self, input_ids: torch.Tensor) -> MatchResult:
        # Always returns empty match (no cache hit)
        return MatchResult(NaiveCacheHandle())
    
    def insert_prefix(self, input_ids: torch.Tensor, indices: torch.Tensor) -> InsertResult:
        # Does nothing (no caching)
        return InsertResult(0, NaiveCacheHandle())
    
    def evict(self, size: int) -> torch.Tensor:
        # Cannot evict (no cache)
        raise NotImplementedError("NaiveCacheManager does not support eviction.")

Memory Overhead

Naive cache has minimal memory overhead:

No tree structure
No metadata tracking
Simple cache handle objects

Performance Comparison

Radix Cache Benefits

Scenario: 100 requests with shared 500-token system prompt

Without radix cache: 100 × 500 = 50,000 tokens computed
With radix cache: 500 (first request) + 100 × unique tokens
Speedup: Up to 50× for prompt processing

When Naive Outperforms Radix

No shared prefixes: Radix overhead without benefits
Highly variable requests: Rare cache hits
Small batch sizes: Overhead dominates

Usage Examples

Multi-Turn Chat with Radix Cache

python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --cache-type radix \
  --page-size 1 \
  --memory-ratio 0.9

Benefits:

Reuses chat history across turns
Reduces latency for follow-up questions
Improves throughput for concurrent conversations

Batch Processing with Shared Prompts

python -m minisgl \
  --model "meta-llama/Llama-3-8B" \
  --cache-type radix \
  --max-running-requests 256

Benefits:

Shares system prompt across all requests
Reuses few-shot examples
Dramatically reduces prefill time

Simple Testing with Naive Cache

python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --cache-type naive \
  --max-running-requests 1

Benefits:

Simpler debugging
Predictable memory usage
No cache-related side effects

Shell Mode (Auto Radix Cache)

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell-mode

In shell mode, radix cache automatically reuses conversation history. Use /reset to clear cache and start a new session.

Page Size Interaction

Cache management interacts with page size (--page-size):

Radix cache: Aligns all cache entries to page boundaries
Smaller page size: More granular cache reuse, higher overhead
Larger page size: Less granular reuse, lower overhead

Recommendation: Use --page-size 1 with radix cache for maximum reuse granularity.

python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --cache-type radix \
  --page-size 1

Memory Ratio Tuning

The --memory-ratio flag controls how much GPU memory is allocated for KV cache:

# Allocate 95% of GPU memory for KV cache
python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --cache-type radix \
  --memory-ratio 0.95

With radix cache:

Higher memory ratio → More cache reuse opportunities
Lower memory ratio → More frequent evictions

Design Origin

Radix cache is adopted from the original SGLang design, which introduced this optimization for efficient KV cache management in LLM serving. Reference: SGLang Blog Post on Radix Attention

Source Code Reference

Cache management implementations:

Registry: kvcache/__init__.py:24 (SUPPORTED_CACHE_MANAGER)
Radix Cache: kvcache/radix_cache.py:101 (RadixPrefixCache)
Naive Cache: kvcache/naive_cache.py:16 (NaivePrefixCache)
Base Interface: kvcache/base.py (BasePrefixCache)

Server argument parsing:

CLI Flag: server/args.py:206 (--cache-type)

Troubleshooting

High Memory Usage with Radix Cache

Reduce memory ratio or number of pages:

python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --cache-type radix \
  --memory-ratio 0.85

Poor Cache Hit Rate

Check if requests actually share prefixes:

# Try naive cache to compare performance
python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --cache-type naive

If performance is similar, your workload may not benefit from radix cache.

Eviction Errors

Naive cache does not support eviction. If you see eviction errors, ensure you’re using radix cache:

python -m minisgl --model "Qwen/Qwen3-0.6B" --cache-type radix

Getting Started

Core Concepts

Guides

Configuration

Performance

Overview

Cache Strategies

Radix Cache (Default)

Naive Cache

Configuration

Radix Cache Details

How It Works

Tree Operations

Prefix Matching

Insertion

Eviction

Node Structure

Memory Management

Page Alignment

Naive Cache Details

Implementation

Memory Overhead

Performance Comparison

Radix Cache Benefits

When Naive Outperforms Radix

Usage Examples

Multi-Turn Chat with Radix Cache

Batch Processing with Shared Prompts

Simple Testing with Naive Cache

Shell Mode (Auto Radix Cache)

Page Size Interaction

Memory Ratio Tuning

Design Origin

Source Code Reference

Troubleshooting

High Memory Usage with Radix Cache

Poor Cache Hit Rate

Eviction Errors

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Overview

​Cache Strategies

​Radix Cache (Default)

​Naive Cache

​Configuration

​Radix Cache Details

​How It Works

​Tree Operations

​Prefix Matching

​Insertion

​Eviction

​Node Structure

​Memory Management

​Page Alignment

​Naive Cache Details

​Implementation

​Memory Overhead

​Performance Comparison

​Radix Cache Benefits

​When Naive Outperforms Radix

​Usage Examples

​Multi-Turn Chat with Radix Cache

​Batch Processing with Shared Prompts

​Simple Testing with Naive Cache

​Shell Mode (Auto Radix Cache)

​Page Size Interaction

​Memory Ratio Tuning

​Design Origin

​Source Code Reference

​Troubleshooting

​High Memory Usage with Radix Cache

​Poor Cache Hit Rate

​Eviction Errors

Build docs developers (and LLMs) love

Overview

Cache Strategies

Radix Cache (Default)

Naive Cache

Configuration

Radix Cache Details

How It Works

Tree Operations

Prefix Matching

Insertion

Eviction

Node Structure

Memory Management

Page Alignment

Naive Cache Details

Implementation

Memory Overhead

Performance Comparison

Radix Cache Benefits

When Naive Outperforms Radix

Usage Examples

Multi-Turn Chat with Radix Cache

Batch Processing with Shared Prompts

Simple Testing with Naive Cache

Shell Mode (Auto Radix Cache)

Page Size Interaction

Memory Ratio Tuning

Design Origin

Source Code Reference

Troubleshooting

High Memory Usage with Radix Cache

Poor Cache Hit Rate

Eviction Errors