Hide CPU overhead by overlapping scheduling with GPU computation
Overlap Scheduling is a performance optimization technique that overlaps CPU scheduling overhead with GPU computation, significantly improving overall system throughput and GPU utilization.
Proposed in the NanoFlow paper, overlap scheduling addresses a key bottleneck in LLM serving: the CPU overhead of scheduling, memory management, and batch preparation can leave the GPU idle between forward passes.Mini-SGLang employs overlap scheduling by default to maximize GPU utilization and minimize latency.Illustration of Overlap Scheduling from LMSYS Blog
The main event loop implements the overlap pattern:
def overlap_loop(self, last_data: ForwardData | None) -> ForwardData | None: """ The main loop of overlapping scheduling and execution. It will overlap the execution of current batch and processing of last batch's results, which can effectively hide CPU latency and improve GPU utilization. """ # Step 1: Receive new messages (non-blocking if we have work to do) blocking = not ( last_data is not None or self.prefill_manager.runnable or self.decode_manager.runnable ) for msg in self.receive_msg(blocking=blocking): self._process_one_msg(msg) # Step 2: Schedule next batch on main stream forward_input = self._schedule_next_batch() ongoing_data = None if forward_input is not None: # Step 3: Launch GPU computation on engine stream with self.engine_stream_ctx: self.engine.stream.wait_stream(self.stream) ongoing_data = (forward_input, self._forward(forward_input)) # Step 4: Process last batch's results on main stream (overlapped!) self._process_last_data(last_data) return ongoing_data
# Schedule batch on main streamforward_input = self._schedule_next_batch() # On self.streamif forward_input is not None: with self.engine_stream_ctx: # Switch to engine stream # Wait for main stream to finish scheduling self.engine.stream.wait_stream(self.stream) # Now safe to execute on engine stream ongoing_data = (forward_input, self._forward(forward_input))# Back on main stream, process last batch (overlapped with GPU)self._process_last_data(last_data)
The copy_done event ensures GPU->CPU copies complete:
# In forward passforward_output = self.engine.forward_batch(batch, sample_args)# This includes: copy_done = torch.cuda.Event()# In process_last_datacopy_done.synchronize() # Wait for CPU tensor to be readynext_token = next_tokens_cpu[i] # Now safe to access
Stream synchronization is critical for correctness. The wait_stream call ensures all scheduling work completes before GPU execution begins, while copy_done.synchronize() ensures CPU can safely read GPU results.
For debugging or comparison purposes, overlap scheduling can be disabled:
# Set environment variableMINISGL_DISABLE_OVERLAP_SCHEDULING=1 python -m minisgl --model "Qwen/Qwen3-0.6B"
With overlap disabled, the system uses the normal loop:
def normal_loop(self) -> None: blocking = not (self.prefill_manager.runnable or self.decode_manager.runnable) for msg in self.receive_msg(blocking=blocking): self._process_one_msg(msg) forward_input = self._schedule_next_batch() ongoing_data = None if forward_input is not None: ongoing_data = (forward_input, self._forward(forward_input)) # Process immediately, no overlap self._process_last_data(ongoing_data)
This processes each batch completely before moving to the next, running everything sequentially.
Disabling overlap scheduling can be useful for debugging race conditions or isolating performance issues, but it will reduce throughput in production.