Overview
The SGLang architecture follows a multi-stage pipeline design:Core Components
Scheduler
The Scheduler is the brain of SGLang’s serving system, located inpython/sglang/srt/managers/scheduler.py. It manages:
- Request Queue Management: Maintains waiting queues and running batches
- Continuous Batching: Dynamically adds and removes requests from batches
- Memory Coordination: Works with memory pools to allocate/deallocate KV cache
- Policy Execution: Implements scheduling policies (FCFS, LPM, DFS-weight)
The Scheduler uses a
running_batch to track currently executing requests and a waiting_queue for pending requests. It continuously decides which requests to admit into the batch based on available memory and scheduling policies.python/sglang/srt/managers/scheduler.py:266-412
Model Executor
The Model Executor consists of two main parts:TpModelWorker
Located inpython/sglang/srt/managers/tp_worker.py, the TpModelWorker handles:
- Tensor Parallelism: Coordinates model sharding across GPUs
- Batch Preparation: Transforms ScheduleBatch to ModelWorkerBatch
- Forward Pass Orchestration: Manages generation and embedding inference
python/sglang/srt/managers/tp_worker.py:206-300
ModelRunner
The ModelRunner (python/sglang/srt/model_executor/model_runner.py) executes the actual model forward passes:
- GPU Memory Management: Allocates KV cache and model weights
- CUDA Graph Support: Optimizes repeated patterns
- Attention Backend Integration: Supports FlashInfer, FlashAttention, Triton, etc.
Memory Manager
SGLang implements a two-level memory pool system inpython/sglang/srt/mem_cache/memory_pool.py:
ReqToTokenPool
Maps requests to their token locations:python/sglang/srt/mem_cache/memory_pool.py:126-147
TokenToKVPool
Manages the actual KV cache storage:- MHATokenToKVPool: Multi-head attention KV cache
- Paged Memory: Organizes cache in fixed-size pages
- FP8 Quantization Support: Optional compression for memory efficiency
The memory pool design enables efficient memory reuse through the RadixCache, allowing multiple requests to share common prefix tokens.
RadixCache
The RadixCache (python/sglang/srt/mem_cache/radix_cache.py) is a prefix tree data structure that:
- Detects Shared Prefixes: Automatically identifies common token sequences
- Enables Prefix Reuse: Multiple requests share cached KV states
- Implements Eviction Policies: LRU, LFU, FIFO, MRU, FILO, priority-based
Data Flow
The data flow through SGLang follows this pattern:Request Processing Flow
python/sglang/srt/managers/schedule_batch.py:20-36
Continuous Batching Loop
The scheduler runs an event loop that continuously processes requests:python/sglang/srt/managers/scheduler.py:1110-1129
Memory Hierarchy
SGLang’s memory system has multiple levels:- GPU KV Cache: Fast on-device storage for active requests
- RadixCache Tree: Logical organization of shared prefixes
- Evictable/Protected: Memory is either locked (in-use) or evictable
- Host Storage (optional): CPU/disk backup for inactive cache
python/sglang/srt/mem_cache/radix_cache.py:641-646
Scheduling Policies
SGLang supports multiple scheduling policies inpython/sglang/srt/managers/schedule_policy.py:
Cache-Aware Policies
- LPM (Longest Prefix Match): Prioritizes requests with longer cached prefixes
- DFS-Weight: Uses depth-first weighting to maximize cache reuse
Cache-Agnostic Policies
- FCFS: First-come-first-served
- LOF: Longest output first
- Random: Random ordering
- Routing-Key: Groups by routing keys for better batching
Cache-aware policies provide better throughput when RadixCache is enabled by maximizing prefix reuse.
Distributed Execution
SGLang supports multiple parallelism strategies:- Tensor Parallelism (TP): Shards model weights across GPUs
- Pipeline Parallelism (PP): Splits model layers across devices
- Data Parallelism (DP): Replicates model for different request batches
- Expert Parallelism (EP): For MoE models, distributes experts
Key Performance Optimizations
1. RadixAttention with Prefix Caching
Automatic detection and reuse of common prefixes across requests. See RadixAttention.2. Continuous Batching
Dynamic batch composition allows requests to enter/exit without waiting. See Continuous Batching.3. CUDA Graphs
Captures and replays kernel launches for decode phases, reducing CPU overhead.4. Paged KV Cache
Non-contiguous memory allocation reduces fragmentation and enables flexible memory management.5. Chunked Prefill
Large prefill requests can be split into chunks to better interleave with decode requests.Configuration
Key configuration parameters:Related Topics
- RadixAttention - The core attention mechanism
- Prefix Caching - How prefix caching works
- Continuous Batching - Dynamic request batching
