Continuous Batching

Continuous batching is a core scheduling technique in SGLang that allows requests to dynamically enter and exit batches without waiting for the entire batch to complete. This dramatically improves GPU utilization and reduces latency compared to traditional static batching.

The Problem with Static Batching

Traditional inference systems use static batching:

# Static batching - ALL requests must complete together
batch = [req1, req2, req3, req4]  # All start
while not all_finished(batch):
    batch = forward(batch)  # Process batch
# All requests finish, then new batch can start

Problems:

Head-of-line blocking: Fast requests wait for slow ones
GPU underutilization: Batch shrinks as requests finish
Increased latency: Requests queue while waiting for batch slots

Continuous Batching Solution

SGLang implements continuous batching:

# Continuous batching - requests join/leave dynamically
running_batch = [req1, req2, req3, req4]
while True:
    # Remove finished requests
    running_batch = [r for r in running_batch if not r.finished()]
    
    # Add new requests if space available
    if has_available_memory():
        new_reqs = get_new_requests()
        running_batch.extend(new_reqs)
    
    # Forward with current batch
    if running_batch:
        running_batch = forward(running_batch)

Benefits:

No head-of-line blocking: Requests finish independently
Higher GPU utilization: Batch stays fuller longer
Lower latency: New requests start immediately

Continuous batching is sometimes called “iteration-level scheduling” or “dynamic batching”.

Implementation in SGLang

Scheduler Event Loop

The scheduler runs a continuous loop in python/sglang/srt/managers/scheduler.py:

@DynamicGradMode()
def event_loop_normal(self):
    """A normal scheduler loop."""
    while True:
        # 1. Receive new requests
        recv_reqs = self.recv_requests()
        self.process_input_requests(recv_reqs)
        
        if self._engine_paused:
            continue
        
        # 2. Get the next batch to run
        batch = self.get_next_batch_to_run()
        self.cur_batch = batch
        
        # 3. Launch the current batch
        if batch:
            result = self.run_batch(batch)
            self.process_batch_result(batch, result)
        else:
            # Idle - do self-check
            self.self_check_during_idle()
        
        # 4. Update last_batch
        self.last_batch = batch

Reference: python/sglang/srt/managers/scheduler.py:1110-1135

Batch Composition

Every iteration, the scheduler rebuilds the batch:

def get_next_batch_to_run(self):
    """Get the next batch of requests to run."""
    
    # Check if running batch needs to continue
    if not self.running_batch.is_empty():
        # Process running requests
        self.update_running_batch()
    
    # Try to add new requests from waiting queue
    num_new_requests = self.add_new_requests_from_waiting_queue()
    
    # Return batch for execution
    if self.running_batch.is_empty():
        return None
    return self.prepare_batch_for_execution(self.running_batch)

The key data structures:

class Scheduler:
    def init_running_status(self):
        # Waiting queue for incoming requests
        self.waiting_queue: List[Req] = []
        
        # The running decoding batch for continuous batching
        self.running_batch: ScheduleBatch = ScheduleBatch(
            reqs=[], batch_is_full=False
        )
        
        # The current forward batch
        self.cur_batch: Optional[ScheduleBatch] = None
        
        # The last forward batch  
        self.last_batch: Optional[ScheduleBatch] = None

Reference: python/sglang/srt/managers/scheduler.py:747-761

Request Phases

Requests go through distinct phases:

1. Prefill (Extend) Phase

Purpose: Process input tokens and generate KV cache

class ForwardMode(Enum):
    EXTEND = auto()  # Prefill phase
    DECODE = auto()  # Decode phase

if batch.forward_mode == ForwardMode.EXTEND:
    # Process multiple tokens per request
    # Generate KV cache for input sequence
    # High computation, high memory allocation

Characteristics:

Processes all input tokens in parallel
Allocates KV cache memory
Compute-intensive (matrix multiplications)
One-time cost per request

2. Decode Phase

Purpose: Generate output tokens one at a time

if batch.forward_mode == ForwardMode.DECODE:
    # Process one token per request
    # Append to existing KV cache  
    # Lower computation, incremental memory

Characteristics:

Generates one token at a time
Incrementally extends KV cache
Memory-bandwidth-bound
Repeated until stopping condition

Decode is typically memory-bandwidth-bound rather than compute-bound, so batching helps amortize memory access costs.

Batch Addition Logic

The scheduler intelligently adds requests to the batch:

class PrefillAdder:
    def add_req(
        self,
        prefix_len: int,
        extend_num_tokens: int,
        max_new_tokens: int,
    ) -> AddReqResult:
        """Check if request can be added to batch."""
        
        # Check if batch is already full
        if self.batch_is_full:
            return AddReqResult.NO_SPACE
        
        # Check token budget
        if extend_num_tokens > self.rem_total_tokens:
            return AddReqResult.NO_TOKEN_BUDGET
        
        # Check request slot budget  
        if self.rem_input_tokens < extend_num_tokens:
            return AddReqResult.NO_INPUT_BUDGET
        
        # Allocate KV cache
        indices = self.alloc_req_slots(extend_num_tokens)
        if indices is None:
            return AddReqResult.OUT_OF_MEMORY
        
        # Successfully added
        self.update_budgets(extend_num_tokens, max_new_tokens)
        return AddReqResult.ACCEPTED

Reference: python/sglang/srt/managers/schedule_policy.py

Resource Constraints

Requests are admitted based on multiple constraints:

Token budget: Total tokens in batch
Memory budget: Available KV cache slots
Request budget: Maximum concurrent requests
Batch size: Configured limits

# Check token capacity
if total_tokens + new_request_tokens > self.max_total_num_tokens:
    return False  # Cannot add

# Check request capacity
if num_requests >= self.max_running_requests:
    return False  # Batch full

# Check memory availability
if available_kv_cache < required_kv_cache:
    return False  # Out of memory

Request Completion

Requests complete independently:

def process_batch_result(self, batch, result):
    """Process the output of a forward batch."""
    
    for i, req in enumerate(batch.reqs):
        # Get generated token
        next_token_id = result.next_token_ids[i]
        
        # Check stopping conditions
        if self.check_stop_condition(req, next_token_id):
            req.to_finish = FINISH_MATCHED_TOKEN(next_token_id)
        elif len(req.output_ids) >= req.sampling_params.max_new_tokens:
            req.to_finish = FINISH_LENGTH(len(req.output_ids))
        
        # Remove finished requests from batch
        if req.finished():
            self.running_batch.remove(req)
            self.cache_finished_req(req)
            self.send_response(req)

Reference: python/sglang/srt/managers/scheduler.py

Each request completes as soon as it reaches its stopping condition, freeing up resources for new requests immediately.

Scheduling Policies

SGLang supports multiple policies for choosing which requests to add:

FCFS (First-Come-First-Served)

Simple: Process requests in arrival order

if self.policy == CacheAgnosticPolicy.FCFS:
    # waiting_queue already in arrival order
    pass

Best for: Fair resource allocation, predictable latency

LPM (Longest Prefix Match)

Smart: Prioritize requests with cached prefixes

if policy == CacheAwarePolicy.LPM:
    # Sort by prefix length (longest first)
    waiting_queue.sort(
        key=lambda r: -len(r.prefix_indices)
    )

Best for: Maximizing cache hits, RAG applications Reference: python/sglang/srt/managers/schedule_policy.py:242-253

LOF (Longest Output First)

Throughput-focused: Schedule long jobs first

if policy == CacheAgnosticPolicy.LOF:
    waiting_queue.sort(
        key=lambda x: -x.sampling_params.max_new_tokens
    )

Best for: Maximizing throughput, batch jobs

Priority Scheduling

QoS-aware: Honor request priorities

if self.enable_priority_scheduling:
    waiting_queue.sort(
        key=lambda x: (x.priority * priority_sign, x.arrival_time)
    )

Best for: Multi-tenant systems, SLA requirements

For most workloads, LPM provides the best balance of throughput and latency when prefix caching is enabled.

Advanced Features

Chunked Prefill

Large prefill requests can be split into chunks:

# Without chunking: Prefill entire 8K tokens at once
request = "[8000 tokens]" + " Generate response:"
# Blocks other requests for ~2 seconds

# With chunking: Split into 4 × 2K chunks
# Chunk 1: tokens[0:2048]    - 0.5s
# Chunk 2: tokens[2048:4096]  - 0.5s (other requests can run)
# Chunk 3: tokens[4096:6144]  - 0.5s (other requests can run)
# Chunk 4: tokens[6144:8000]  - 0.5s (other requests can run)

Configuration:

--chunked-prefill-size 2048  # Max tokens per prefill chunk
--enable-mixed-chunk         # Mix prefill and decode in same batch

Benefits:

Reduces TTFT (Time to First Token) spikes
Improves fairness between long and short requests
Better interleaving of prefill and decode

Reference: python/sglang/srt/managers/scheduler.py:763-787

Overlapped Scheduling

Overlap CPU processing with GPU execution:

@DynamicGradMode()
def event_loop_overlap(self):
    """Overlap CPU processing with GPU computation."""
    self.result_queue: Deque = deque()
    
    while True:
        # Receive new requests (CPU)
        recv_reqs = self.recv_requests()
        self.process_input_requests(recv_reqs)
        
        # Get next batch (CPU)
        batch = self.get_next_batch_to_run()
        
        # Launch current batch (GPU)
        if batch:
            batch_result = self.run_batch(batch)
            self.result_queue.append((batch.copy(), batch_result))
        
        # Process LAST batch result (CPU) while GPU runs current batch
        if self.result_queue:
            tmp_batch, tmp_result = self.result_queue.popleft()
            self.process_batch_result(tmp_batch, tmp_result)

Reference: python/sglang/srt/managers/scheduler.py:1137-1188

Overlapped scheduling can improve throughput by 10-20% by hiding CPU overhead behind GPU computation.

Preemption

High-priority requests can preempt low-priority ones:

if self.try_preemption and memory_pressure_high:
    # Find low-priority requests to evict
    preempt_candidates = [
        r for r in self.running_batch.reqs 
        if r.priority < threshold
    ]
    
    # Save state and remove from batch
    for req in preempt_candidates:
        self.save_request_state(req)
        self.running_batch.remove(req)
        self.waiting_queue.insert(0, req)  # Re-queue

Configuration:

--enable-priority-scheduling
--priority-scheduling-preemption-threshold 0.5

Performance Optimization

Memory Estimation

The scheduler estimates future memory needs:

def estimate_memory_usage(self, req):
    """Estimate memory for request's full lifetime."""
    
    # Current tokens
    current_tokens = len(req.prefix_indices) + len(req.output_ids)
    
    # Estimated future tokens (with clipping)
    estimated_new = min(
        req.sampling_params.max_new_tokens,
        CLIP_MAX_NEW_TOKENS  # Prevent over-reservation
    )
    
    # Total estimate
    total_estimated = current_tokens + estimated_new
    return total_estimated * self.kv_cache_bytes_per_token

The scheduler uses conservative estimation to avoid OOM, controlled by --schedule-conservativeness (default 1.0).

Token Ratio Tuning

Balance between prefill and decode:

# Adaptive new token ratio
self.new_token_ratio = self.init_new_token_ratio  # e.g., 0.4

# Allows up to 40% of capacity for new prefill tokens
# Remaining 60% reserved for ongoing decode
max_prefill_tokens = self.max_total_num_tokens * self.new_token_ratio

Configuration:

# Set via environment variable
export SGLANG_INIT_NEW_TOKEN_RATIO=0.4
export SGLANG_MIN_NEW_TOKEN_RATIO_FACTOR=0.5  # Min ratio = 0.4 * 0.5 = 0.2

Batch Size Limits

Control batch size:

--max-running-requests 256    # Maximum concurrent requests
--max-total-num-tokens 16384  # Total token capacity
--max-prefill-tokens 4096     # Max tokens in prefill phase

Monitoring and Metrics

Key metrics to track:

# Throughput
tokens_per_second = total_output_tokens / wall_clock_time
requests_per_second = total_requests / wall_clock_time

# Latency  
ttft = time_to_first_token  # Prefill latency
tpot = time_per_output_token  # Decode latency
e2e_latency = request_completion_time - request_arrival_time

# Utilization
batch_occupancy = len(running_batch.reqs) / max_running_requests
memory_utilization = used_kv_cache / total_kv_cache
queue_depth = len(waiting_queue)

Monitor queue depth closely - sustained queue growth indicates capacity issues.

Best Practices

Right-size batch limits: Balance latency and throughput
- Smaller batches: Lower latency, lower throughput
- Larger batches: Higher latency, higher throughput
Use appropriate scheduling policy:
- FCFS for fairness
- LPM for cache-heavy workloads
- Priority for multi-tenant systems
Enable chunked prefill for mixed workloads:
- Prevents long prefills from blocking short requests
- Set chunk size to ~2048 tokens
Configure memory conservatively:
- Leave 10-20% headroom for scheduling flexibility
- Avoid OOM which degrades performance severely
Monitor and tune:
- Watch queue depth and batch utilization
- Adjust token ratios based on workload
- Profile to identify bottlenecks

Common Issues

Request Starvation

Symptom: Some requests wait very long in queue Solution:

Use FCFS or priority scheduling
Enable preemption for high-priority requests
Reduce max_new_tokens limits

Low GPU Utilization

Symptom: GPU not fully utilized Solution:

Increase max_running_requests
Increase max_total_num_tokens
Enable overlapped scheduling
Check for CPU bottlenecks

High Memory Pressure

Symptom: Frequent eviction, OOM errors Solution:

Increase mem_fraction_static
Reduce max_running_requests
Enable KV cache quantization
Use more aggressive eviction policy

System Architecture - Overall system design
Prefix Caching - Cache management with batching
RadixAttention - Attention with shared prefixes

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​The Problem with Static Batching

​Continuous Batching Solution

​Implementation in SGLang

​Scheduler Event Loop

​Batch Composition

​Request Phases

​1. Prefill (Extend) Phase

​2. Decode Phase

​Batch Addition Logic

​Resource Constraints

​Request Completion

​Scheduling Policies

​FCFS (First-Come-First-Served)

​LPM (Longest Prefix Match)

​LOF (Longest Output First)

​Priority Scheduling

​Advanced Features

​Chunked Prefill

​Overlapped Scheduling

​Preemption

​Performance Optimization

​Memory Estimation

​Token Ratio Tuning

​Batch Size Limits

​Monitoring and Metrics

​Best Practices

​Common Issues

​Request Starvation

​Low GPU Utilization

​High Memory Pressure

​Related Topics

The Problem with Static Batching

Continuous Batching Solution

Implementation in SGLang

Scheduler Event Loop

Batch Composition

Request Phases

1. Prefill (Extend) Phase

2. Decode Phase

Batch Addition Logic

Resource Constraints

Request Completion

Scheduling Policies

FCFS (First-Come-First-Served)

LPM (Longest Prefix Match)

LOF (Longest Output First)

Priority Scheduling

Advanced Features

Chunked Prefill

Overlapped Scheduling

Preemption

Performance Optimization

Memory Estimation

Token Ratio Tuning

Batch Size Limits

Monitoring and Metrics

Best Practices

Common Issues

Request Starvation

Low GPU Utilization

High Memory Pressure

Related Topics