Chunked Prefill is a technique that splits long prompts into smaller chunks during the prefill phase, significantly reducing peak memory usage and preventing Out-Of-Memory (OOM) errors in long-context serving.
Introduced by Sarathi-Serve, Chunked Prefill is enabled by default in Mini-SGLang. This feature addresses the memory challenges of processing very long input sequences by breaking them into manageable pieces.
Chunked Prefill is particularly important for long-context models where a single prefill batch could consume excessive GPU memory, starving the decode phase and reducing overall throughput.
Chunked requests are marked with a special class to prevent premature completion:
class ChunkedReq(Req): def append_host(self, next_token: torch.Tensor) -> None: raise NotImplementedError("ChunkedReq should not be sampled") @property def can_decode(self) -> bool: return False # avoid being added to decode manager
This ensures chunked prefill requests:
Are not sampled for output tokens
Stay in prefill queue until all chunks are processed
The scheduler processes chunks across multiple iterations:
def schedule_next_batch(self, prefill_budget: int) -> Batch | None: adder = PrefillAdder( token_budget=prefill_budget, reserved_size=self.decode_manager.inflight_tokens, cache_manager=self.cache_manager, table_manager=self.table_manager, ) reqs: List[Req] = [] chunked_list: List[PendingReq] = [] for pending_req in self.pending_list: if req := adder.try_add_one(pending_req): pending_req.chunked_req = None if isinstance(req, ChunkedReq): # Keep in pending list for next chunk pending_req.chunked_req = req chunked_list.append(pending_req) reqs.append(req) else: break # Chunked requests stay at front of queue self.pending_list = chunked_list + self.pending_list[len(reqs):] return Batch(reqs=reqs, phase="prefill")
Chunked requests are prioritized in the pending list to ensure all chunks of a request are processed before moving to new requests. This prevents head-of-line blocking.
The max-prefill-length is also accessible via SchedulerConfig.max_extend_tokens.
The token_budget in the prefill adder is set to max_extend_tokens and shared across all requests in a batch. This ensures the total number of tokens processed in a single iteration stays within limits.
Too small: Excessive overhead from many small batches
Recommended: Start with default (8192) and adjust based on:
Available GPU memory
Typical sequence lengths
Model size and architecture
Setting --max-prefill-length to a very small value (e.g., 128) is not recommended as it may significantly degrade performance due to excessive chunking overhead. Values between 4096-16384 work well for most use cases.
By limiting the maximum tokens processed in a single forward pass, chunked prefill prevents memory spikes that could crash the server.Without chunked prefill: