Chunked Prefill

Chunked Prefill is a technique that splits long prompts into smaller chunks during the prefill phase, significantly reducing peak memory usage and preventing Out-Of-Memory (OOM) errors in long-context serving.

Overview

Introduced by Sarathi-Serve, Chunked Prefill is enabled by default in Mini-SGLang. This feature addresses the memory challenges of processing very long input sequences by breaking them into manageable pieces.

Chunked Prefill is particularly important for long-context models where a single prefill batch could consume excessive GPU memory, starving the decode phase and reducing overall throughput.

The Problem: Memory Spikes

During the prefill phase, the system must:

Compute attention over the entire input sequence
Store KV cache for all input tokens
Store intermediate activations for the forward pass

For long sequences (e.g., 32K or 128K tokens), this can cause:

Memory spikes that exceed GPU capacity
OOM errors that crash the serving system
Starvation of decode requests as memory is monopolized by prefill

Traditional Approach

Request: [32,768 tokens] → Process all at once → OOM!

Chunked Prefill Approach

Request: [32,768 tokens]
  Chunk 1: [0:8192] → Process → Cache KV
  Chunk 2: [8192:16384] → Process → Cache KV  
  Chunk 3: [16384:24576] → Process → Cache KV
  Chunk 4: [24576:32768] → Process → Cache KV
  Decode: [32768] → Generate tokens

How It Works

Prefill Manager

The PrefillManager handles chunking logic in /home/daytona/workspace/source/python/minisgl/scheduler/prefill.py:116:

class PrefillAdder:
    def _add_one_req(
        self,
        pending_req: PendingReq,
        cache_handle: BaseCacheHandle,
        table_idx: int,
        cached_len: int,
    ) -> Req:
        remain_len = pending_req.input_len - cached_len
        chunk_size = min(self.token_budget, remain_len)
        is_chunked = chunk_size < remain_len
        CLS = ChunkedReq if is_chunked else Req
        
        # Process only chunk_size tokens
        self.token_budget -= chunk_size
        _slice = slice(cached_len, cached_len + chunk_size)
        device_ids = self.table_manager.token_pool[table_idx, _slice]
        device_ids.copy_(pending_req.input_ids[_slice].pin_memory(), non_blocking=True)
        
        return CLS(
            input_ids=pending_req.input_ids[: cached_len + chunk_size],
            table_idx=table_idx,
            cached_len=cached_len,
            output_len=pending_req.output_len,
            uid=pending_req.uid,
            cache_handle=cache_handle,
            sampling_params=pending_req.sampling_params,
        )

Chunked Request Class

Chunked requests are marked with a special class to prevent premature completion:

class ChunkedReq(Req):
    def append_host(self, next_token: torch.Tensor) -> None:
        raise NotImplementedError("ChunkedReq should not be sampled")
    
    @property
    def can_decode(self) -> bool:
        return False  # avoid being added to decode manager

This ensures chunked prefill requests:

Are not sampled for output tokens
Stay in prefill queue until all chunks are processed
Don’t enter decode phase prematurely

Scheduling Logic

The scheduler processes chunks across multiple iterations:

def schedule_next_batch(self, prefill_budget: int) -> Batch | None:
    adder = PrefillAdder(
        token_budget=prefill_budget,
        reserved_size=self.decode_manager.inflight_tokens,
        cache_manager=self.cache_manager,
        table_manager=self.table_manager,
    )
    
    reqs: List[Req] = []
    chunked_list: List[PendingReq] = []
    
    for pending_req in self.pending_list:
        if req := adder.try_add_one(pending_req):
            pending_req.chunked_req = None
            if isinstance(req, ChunkedReq):
                # Keep in pending list for next chunk
                pending_req.chunked_req = req
                chunked_list.append(pending_req)
            reqs.append(req)
        else:
            break
    
    # Chunked requests stay at front of queue
    self.pending_list = chunked_list + self.pending_list[len(reqs):]
    return Batch(reqs=reqs, phase="prefill")

Chunked requests are prioritized in the pending list to ensure all chunks of a request are processed before moving to new requests. This prevents head-of-line blocking.

Configuration

Max Prefill Length

Control the chunk size with --max-prefill-length:

# Default: 8192 tokens per chunk
python -m minisgl --model "Qwen/Qwen3-0.6B" --max-prefill-length 8192

# Larger chunks (more memory, faster prefill)
python -m minisgl --model "Qwen/Qwen3-0.6B" --max-prefill-length 16384

# Smaller chunks (less memory, slower prefill)
python -m minisgl --model "Qwen/Qwen3-0.6B" --max-prefill-length 4096

The max-prefill-length is also accessible via SchedulerConfig.max_extend_tokens.

The token_budget in the prefill adder is set to max_extend_tokens and shared across all requests in a batch. This ensures the total number of tokens processed in a single iteration stays within limits.

Choosing the Right Value

Too large: May cause OOM on long sequences
Too small: Excessive overhead from many small batches
Recommended: Start with default (8192) and adjust based on:
- Available GPU memory
- Typical sequence lengths
- Model size and architecture

Setting --max-prefill-length to a very small value (e.g., 128) is not recommended as it may significantly degrade performance due to excessive chunking overhead. Values between 4096-16384 work well for most use cases.

Benefits

Prevents OOM Errors

By limiting the maximum tokens processed in a single forward pass, chunked prefill prevents memory spikes that could crash the server. Without chunked prefill:

[Long request: 64K tokens] → OOM → Server crash

With chunked prefill:

[Long request: 64K tokens]
  → Chunk 1: 8K tokens ✓
  → Chunk 2: 8K tokens ✓
  → ... (8 chunks total)
  → Success!

Enables Long-Context Serving

Models with large context windows (32K, 128K, or even 1M tokens) can be served effectively:

# Serve Qwen3-14B with 32K context
python -m minisgl --model "Qwen/Qwen3-14B" --max-prefill-length 8192

Better Resource Utilization

By controlling prefill memory usage, more memory remains available for:

Decode requests: Maintains responsiveness
KV cache: Supports more concurrent requests
Batch processing: Higher throughput

Memory Estimation

The prefill adder estimates memory requirements before processing:

def _try_allocate_one(self, req: PendingReq) -> Tuple[BaseCacheHandle, int] | None:
    handle = self.cache_manager.match_req(req).cuda_handle
    cached_len = handle.cached_len
    extend_len = req.input_len - cached_len
    estimated_len = extend_len + req.output_len
    
    # Check if we have enough memory (including reserved space for decode)
    if estimated_len + self.reserved_size > self.cache_manager.available_size:
        return None
    
    self.cache_manager.lock(handle)
    # Double-check after locking
    if estimated_len + self.reserved_size > self.cache_manager.available_size:
        return self.cache_manager.unlock(handle)
    
    table_idx = self.table_manager.allocate()
    return handle, table_idx

The reserved_size accounts for in-flight decode requests to prevent starvation:

adder = PrefillAdder(
    token_budget=prefill_budget,
    reserved_size=self.decode_manager.inflight_tokens,  # Reserve for decode
    cache_manager=self.cache_manager,
    table_manager=self.table_manager,
)

Integration with Radix Cache

Chunked Prefill works seamlessly with Radix Cache:

First chunk: Matches prefix in cache, only processes uncached portion
Subsequent chunks: Continue from cached state
After completion: Full sequence is cached for future reuse

# After each chunk is processed (for non-ChunkedReq)
if batch.is_prefill:
    self.cache_manager.cache_req(req, finished=False)

This means even partially processed requests can benefit from and contribute to the cache.

Performance Considerations

Trade-offs

Aspect	Large Chunks	Small Chunks
Memory usage	Higher peak	Lower peak
Prefill speed	Faster (fewer iterations)	Slower (more iterations)
OOM risk	Higher	Lower
Decode starvation	Higher	Lower

Benchmarking

For performance testing without chunked prefill:

# Set very large max-prefill-length to effectively disable chunking
python -m minisgl --model "Qwen/Qwen3-0.6B" --max-prefill-length 100000

Compare throughput and latency with different --max-prefill-length values to find the optimal setting for your workload.

Example: Processing a Long Document

# Request with 40K token document
request = {
    "messages": [{"role": "user", "content": long_document}],
    "max_tokens": 1000
}

# System behavior with max-prefill-length=8192:
# Iteration 1: Process tokens 0-8192 (chunk 1)
# Iteration 2: Process tokens 8192-16384 (chunk 2)
# Iteration 3: Process tokens 16384-24576 (chunk 3)
# Iteration 4: Process tokens 24576-32768 (chunk 4)
# Iteration 5: Process tokens 32768-40000 (chunk 5)
# Iteration 6+: Decode phase, generate 1000 tokens

Each chunk is processed in a separate forward pass, keeping memory usage stable.

Radix Cache

Learn how KV cache reuse works with chunked prefill

Architecture

Understand the scheduler’s role in chunked prefill

Overlap Scheduling

See how chunking interacts with scheduling optimizations

Getting Started

Core Concepts

Guides

Configuration

Performance

Overview

The Problem: Memory Spikes

Traditional Approach

Chunked Prefill Approach

How It Works

Prefill Manager

Chunked Request Class

Scheduling Logic

Configuration

Max Prefill Length

Choosing the Right Value

Benefits

Prevents OOM Errors

Enables Long-Context Serving

Better Resource Utilization

Memory Estimation

Integration with Radix Cache

Performance Considerations

Trade-offs

Benchmarking

Example: Processing a Long Document

Radix Cache

Architecture

Overlap Scheduling

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Overview

​The Problem: Memory Spikes

​Traditional Approach

​Chunked Prefill Approach

​How It Works

​Prefill Manager

​Chunked Request Class

​Scheduling Logic

​Configuration

​Max Prefill Length

​Choosing the Right Value

​Benefits

​Prevents OOM Errors

​Enables Long-Context Serving

​Better Resource Utilization

​Memory Estimation

​Integration with Radix Cache

​Performance Considerations

​Trade-offs

​Benchmarking

​Example: Processing a Long Document

​Related Concepts

Radix Cache

Architecture

Overlap Scheduling

Build docs developers (and LLMs) love

Overview

The Problem: Memory Spikes

Traditional Approach

Chunked Prefill Approach

How It Works

Prefill Manager

Chunked Request Class

Scheduling Logic

Configuration

Max Prefill Length

Choosing the Right Value

Benefits

Prevents OOM Errors

Enables Long-Context Serving

Better Resource Utilization

Memory Estimation

Integration with Radix Cache

Performance Considerations

Trade-offs

Benchmarking

Example: Processing a Long Document

Related Concepts