Scheduler
The scheduler is the core component that manages request batching, memory allocation, and execution orchestration in SGLang.Overview
Location:python/sglang/srt/managers/scheduler.py
Key Responsibilities:
- Request queueing and prioritization
- Dynamic batch formation
- Memory allocation via token-to-KV pool
- Prefix cache management (RadixAttention)
- Request lifecycle management
Request States
A request transitions through several states:Scheduling Loop
Main Loop
Batch Processing
Dynamic Batching
Batch Formation
The scheduler dynamically forms batches based on:- Available memory
- Request priorities
- Prefill chunking constraints
Continuous Batching
Requests can join or leave batches at any time:Chunked Prefill
Why Chunk?
Large prefills can block decode requests, increasing latency:Implementation
Memory Management
Token-to-KV Pool
The scheduler allocates KV cache via a memory pool:Eviction Policy
When memory is full, the scheduler can evict cached prefixes:RadixAttention (Prefix Caching)
Radix Tree Structure
The scheduler maintains a radix tree to track shared prefixes:Prefix Matching
When a new request arrives:Cache Insertion
After computing new KV cache:Request Prioritization
Priority Levels
Requests can have different priorities:Scheduling with Priority
Sampling
Token Sampling
After model forward pass, sample next tokens:Penalties
Finish Conditions
Checking Completion
Performance Tuning
Key Parameters
Monitoring
Advanced Features
Speculative Decoding
Use a small draft model to speculate future tokens:Multi-Model Scheduling
Schedule across multiple model replicas:Debugging
Enable Scheduler Logging
Trace Request
Resources
Next Steps
- Memory Management - KV cache and memory pools
- Architecture Overview - Overall system design
- Kernel Development - Optimize forward pass
