Continuous batching is a core scheduling technique in SGLang that allows requests to dynamically enter and exit batches without waiting for the entire batch to complete. This dramatically improves GPU utilization and reduces latency compared to traditional static batching.
The Problem with Static Batching
Traditional inference systems use static batching:
# Static batching - ALL requests must complete together
batch = [req1, req2, req3, req4] # All start
while not all_finished(batch):
batch = forward(batch) # Process batch
# All requests finish, then new batch can start
Problems:
- Head-of-line blocking: Fast requests wait for slow ones
- GPU underutilization: Batch shrinks as requests finish
- Increased latency: Requests queue while waiting for batch slots
Continuous Batching Solution
SGLang implements continuous batching:
# Continuous batching - requests join/leave dynamically
running_batch = [req1, req2, req3, req4]
while True:
# Remove finished requests
running_batch = [r for r in running_batch if not r.finished()]
# Add new requests if space available
if has_available_memory():
new_reqs = get_new_requests()
running_batch.extend(new_reqs)
# Forward with current batch
if running_batch:
running_batch = forward(running_batch)
Benefits:
- No head-of-line blocking: Requests finish independently
- Higher GPU utilization: Batch stays fuller longer
- Lower latency: New requests start immediately
Continuous batching is sometimes called “iteration-level scheduling” or “dynamic batching”.
Implementation in SGLang
Scheduler Event Loop
The scheduler runs a continuous loop in python/sglang/srt/managers/scheduler.py:
@DynamicGradMode()
def event_loop_normal(self):
"""A normal scheduler loop."""
while True:
# 1. Receive new requests
recv_reqs = self.recv_requests()
self.process_input_requests(recv_reqs)
if self._engine_paused:
continue
# 2. Get the next batch to run
batch = self.get_next_batch_to_run()
self.cur_batch = batch
# 3. Launch the current batch
if batch:
result = self.run_batch(batch)
self.process_batch_result(batch, result)
else:
# Idle - do self-check
self.self_check_during_idle()
# 4. Update last_batch
self.last_batch = batch
Reference: python/sglang/srt/managers/scheduler.py:1110-1135
Batch Composition
Every iteration, the scheduler rebuilds the batch:
def get_next_batch_to_run(self):
"""Get the next batch of requests to run."""
# Check if running batch needs to continue
if not self.running_batch.is_empty():
# Process running requests
self.update_running_batch()
# Try to add new requests from waiting queue
num_new_requests = self.add_new_requests_from_waiting_queue()
# Return batch for execution
if self.running_batch.is_empty():
return None
return self.prepare_batch_for_execution(self.running_batch)
The key data structures:
class Scheduler:
def init_running_status(self):
# Waiting queue for incoming requests
self.waiting_queue: List[Req] = []
# The running decoding batch for continuous batching
self.running_batch: ScheduleBatch = ScheduleBatch(
reqs=[], batch_is_full=False
)
# The current forward batch
self.cur_batch: Optional[ScheduleBatch] = None
# The last forward batch
self.last_batch: Optional[ScheduleBatch] = None
Reference: python/sglang/srt/managers/scheduler.py:747-761
Request Phases
Requests go through distinct phases:
1. Prefill (Extend) Phase
Purpose: Process input tokens and generate KV cache
class ForwardMode(Enum):
EXTEND = auto() # Prefill phase
DECODE = auto() # Decode phase
if batch.forward_mode == ForwardMode.EXTEND:
# Process multiple tokens per request
# Generate KV cache for input sequence
# High computation, high memory allocation
Characteristics:
- Processes all input tokens in parallel
- Allocates KV cache memory
- Compute-intensive (matrix multiplications)
- One-time cost per request
2. Decode Phase
Purpose: Generate output tokens one at a time
if batch.forward_mode == ForwardMode.DECODE:
# Process one token per request
# Append to existing KV cache
# Lower computation, incremental memory
Characteristics:
- Generates one token at a time
- Incrementally extends KV cache
- Memory-bandwidth-bound
- Repeated until stopping condition
Decode is typically memory-bandwidth-bound rather than compute-bound, so batching helps amortize memory access costs.
Batch Addition Logic
The scheduler intelligently adds requests to the batch:
class PrefillAdder:
def add_req(
self,
prefix_len: int,
extend_num_tokens: int,
max_new_tokens: int,
) -> AddReqResult:
"""Check if request can be added to batch."""
# Check if batch is already full
if self.batch_is_full:
return AddReqResult.NO_SPACE
# Check token budget
if extend_num_tokens > self.rem_total_tokens:
return AddReqResult.NO_TOKEN_BUDGET
# Check request slot budget
if self.rem_input_tokens < extend_num_tokens:
return AddReqResult.NO_INPUT_BUDGET
# Allocate KV cache
indices = self.alloc_req_slots(extend_num_tokens)
if indices is None:
return AddReqResult.OUT_OF_MEMORY
# Successfully added
self.update_budgets(extend_num_tokens, max_new_tokens)
return AddReqResult.ACCEPTED
Reference: python/sglang/srt/managers/schedule_policy.py
Resource Constraints
Requests are admitted based on multiple constraints:
- Token budget: Total tokens in batch
- Memory budget: Available KV cache slots
- Request budget: Maximum concurrent requests
- Batch size: Configured limits
# Check token capacity
if total_tokens + new_request_tokens > self.max_total_num_tokens:
return False # Cannot add
# Check request capacity
if num_requests >= self.max_running_requests:
return False # Batch full
# Check memory availability
if available_kv_cache < required_kv_cache:
return False # Out of memory
Request Completion
Requests complete independently:
def process_batch_result(self, batch, result):
"""Process the output of a forward batch."""
for i, req in enumerate(batch.reqs):
# Get generated token
next_token_id = result.next_token_ids[i]
# Check stopping conditions
if self.check_stop_condition(req, next_token_id):
req.to_finish = FINISH_MATCHED_TOKEN(next_token_id)
elif len(req.output_ids) >= req.sampling_params.max_new_tokens:
req.to_finish = FINISH_LENGTH(len(req.output_ids))
# Remove finished requests from batch
if req.finished():
self.running_batch.remove(req)
self.cache_finished_req(req)
self.send_response(req)
Reference: python/sglang/srt/managers/scheduler.py
Each request completes as soon as it reaches its stopping condition, freeing up resources for new requests immediately.
Scheduling Policies
SGLang supports multiple policies for choosing which requests to add:
FCFS (First-Come-First-Served)
Simple: Process requests in arrival order
if self.policy == CacheAgnosticPolicy.FCFS:
# waiting_queue already in arrival order
pass
Best for: Fair resource allocation, predictable latency
LPM (Longest Prefix Match)
Smart: Prioritize requests with cached prefixes
if policy == CacheAwarePolicy.LPM:
# Sort by prefix length (longest first)
waiting_queue.sort(
key=lambda r: -len(r.prefix_indices)
)
Best for: Maximizing cache hits, RAG applications
Reference: python/sglang/srt/managers/schedule_policy.py:242-253
LOF (Longest Output First)
Throughput-focused: Schedule long jobs first
if policy == CacheAgnosticPolicy.LOF:
waiting_queue.sort(
key=lambda x: -x.sampling_params.max_new_tokens
)
Best for: Maximizing throughput, batch jobs
Priority Scheduling
QoS-aware: Honor request priorities
if self.enable_priority_scheduling:
waiting_queue.sort(
key=lambda x: (x.priority * priority_sign, x.arrival_time)
)
Best for: Multi-tenant systems, SLA requirements
For most workloads, LPM provides the best balance of throughput and latency when prefix caching is enabled.
Advanced Features
Chunked Prefill
Large prefill requests can be split into chunks:
# Without chunking: Prefill entire 8K tokens at once
request = "[8000 tokens]" + " Generate response:"
# Blocks other requests for ~2 seconds
# With chunking: Split into 4 × 2K chunks
# Chunk 1: tokens[0:2048] - 0.5s
# Chunk 2: tokens[2048:4096] - 0.5s (other requests can run)
# Chunk 3: tokens[4096:6144] - 0.5s (other requests can run)
# Chunk 4: tokens[6144:8000] - 0.5s (other requests can run)
Configuration:
--chunked-prefill-size 2048 # Max tokens per prefill chunk
--enable-mixed-chunk # Mix prefill and decode in same batch
Benefits:
- Reduces TTFT (Time to First Token) spikes
- Improves fairness between long and short requests
- Better interleaving of prefill and decode
Reference: python/sglang/srt/managers/scheduler.py:763-787
Overlapped Scheduling
Overlap CPU processing with GPU execution:
@DynamicGradMode()
def event_loop_overlap(self):
"""Overlap CPU processing with GPU computation."""
self.result_queue: Deque = deque()
while True:
# Receive new requests (CPU)
recv_reqs = self.recv_requests()
self.process_input_requests(recv_reqs)
# Get next batch (CPU)
batch = self.get_next_batch_to_run()
# Launch current batch (GPU)
if batch:
batch_result = self.run_batch(batch)
self.result_queue.append((batch.copy(), batch_result))
# Process LAST batch result (CPU) while GPU runs current batch
if self.result_queue:
tmp_batch, tmp_result = self.result_queue.popleft()
self.process_batch_result(tmp_batch, tmp_result)
Reference: python/sglang/srt/managers/scheduler.py:1137-1188
Overlapped scheduling can improve throughput by 10-20% by hiding CPU overhead behind GPU computation.
Preemption
High-priority requests can preempt low-priority ones:
if self.try_preemption and memory_pressure_high:
# Find low-priority requests to evict
preempt_candidates = [
r for r in self.running_batch.reqs
if r.priority < threshold
]
# Save state and remove from batch
for req in preempt_candidates:
self.save_request_state(req)
self.running_batch.remove(req)
self.waiting_queue.insert(0, req) # Re-queue
Configuration:
--enable-priority-scheduling
--priority-scheduling-preemption-threshold 0.5
Memory Estimation
The scheduler estimates future memory needs:
def estimate_memory_usage(self, req):
"""Estimate memory for request's full lifetime."""
# Current tokens
current_tokens = len(req.prefix_indices) + len(req.output_ids)
# Estimated future tokens (with clipping)
estimated_new = min(
req.sampling_params.max_new_tokens,
CLIP_MAX_NEW_TOKENS # Prevent over-reservation
)
# Total estimate
total_estimated = current_tokens + estimated_new
return total_estimated * self.kv_cache_bytes_per_token
The scheduler uses conservative estimation to avoid OOM, controlled by --schedule-conservativeness (default 1.0).
Token Ratio Tuning
Balance between prefill and decode:
# Adaptive new token ratio
self.new_token_ratio = self.init_new_token_ratio # e.g., 0.4
# Allows up to 40% of capacity for new prefill tokens
# Remaining 60% reserved for ongoing decode
max_prefill_tokens = self.max_total_num_tokens * self.new_token_ratio
Configuration:
# Set via environment variable
export SGLANG_INIT_NEW_TOKEN_RATIO=0.4
export SGLANG_MIN_NEW_TOKEN_RATIO_FACTOR=0.5 # Min ratio = 0.4 * 0.5 = 0.2
Batch Size Limits
Control batch size:
--max-running-requests 256 # Maximum concurrent requests
--max-total-num-tokens 16384 # Total token capacity
--max-prefill-tokens 4096 # Max tokens in prefill phase
Monitoring and Metrics
Key metrics to track:
# Throughput
tokens_per_second = total_output_tokens / wall_clock_time
requests_per_second = total_requests / wall_clock_time
# Latency
ttft = time_to_first_token # Prefill latency
tpot = time_per_output_token # Decode latency
e2e_latency = request_completion_time - request_arrival_time
# Utilization
batch_occupancy = len(running_batch.reqs) / max_running_requests
memory_utilization = used_kv_cache / total_kv_cache
queue_depth = len(waiting_queue)
Monitor queue depth closely - sustained queue growth indicates capacity issues.
Best Practices
-
Right-size batch limits: Balance latency and throughput
- Smaller batches: Lower latency, lower throughput
- Larger batches: Higher latency, higher throughput
-
Use appropriate scheduling policy:
- FCFS for fairness
- LPM for cache-heavy workloads
- Priority for multi-tenant systems
-
Enable chunked prefill for mixed workloads:
- Prevents long prefills from blocking short requests
- Set chunk size to ~2048 tokens
-
Configure memory conservatively:
- Leave 10-20% headroom for scheduling flexibility
- Avoid OOM which degrades performance severely
-
Monitor and tune:
- Watch queue depth and batch utilization
- Adjust token ratios based on workload
- Profile to identify bottlenecks
Common Issues
Request Starvation
Symptom: Some requests wait very long in queue
Solution:
- Use FCFS or priority scheduling
- Enable preemption for high-priority requests
- Reduce max_new_tokens limits
Low GPU Utilization
Symptom: GPU not fully utilized
Solution:
- Increase max_running_requests
- Increase max_total_num_tokens
- Enable overlapped scheduling
- Check for CPU bottlenecks
High Memory Pressure
Symptom: Frequent eviction, OOM errors
Solution:
- Increase mem_fraction_static
- Reduce max_running_requests
- Enable KV cache quantization
- Use more aggressive eviction policy