Architecture Overview
This guide provides an overview of SGLang’s architecture, components, and design principles.High-Level Architecture
SGLang consists of three main layers:Core Components
1. Engine
TheEngine is the main entry point for inference. It coordinates between the tokenizer manager, scheduler, and detokenizer.
Location: python/sglang/srt/entrypoints/engine.py
Key Responsibilities:
- Initialize model and workers
- Manage request lifecycle
- Coordinate inter-process communication
2. Scheduler
The scheduler manages batching, memory allocation, and request execution. Location:python/sglang/srt/managers/scheduler.py
Key Features:
- Dynamic batching: Combines requests for efficient GPU utilization
- Continuous batching: Processes requests as they arrive
- Prefix caching (RadixAttention): Reuses KV cache for common prefixes
- Chunked prefill: Breaks large prefills into smaller chunks
3. Memory Management
Location:python/sglang/srt/mem_cache/
Components:
- Token-to-KV pool: Maps tokens to KV cache locations
- Memory pool: Pre-allocated GPU memory for KV cache
- Radix tree: Efficient prefix matching and reuse
4. Model Runner
Executes the actual model forward pass. Location:python/sglang/srt/model_executor/model_runner.py
Key Responsibilities:
- Load model weights
- Execute forward pass (prefill and decode)
- Apply sampling
- Manage CUDA graphs
- Prefill: Process input tokens (compute KV cache)
- Decode: Generate one token at a time (use cached KV)
- Extend: Hybrid mode for mid-sequence insertions
5. Attention Backend
Optimized attention implementations. Location:python/sglang/srt/layers/attention/
Backends:
- FlashInfer: Default, highly optimized
- FlashAttention: Alternative backend
- Triton: Custom Triton kernels
- Grouped-query attention (GQA)
- Multi-query attention (MQA)
- Sliding window attention
- Sparse attention patterns
Advanced Features
RadixAttention (Prefix Caching)
Automatically detects and reuses common prompt prefixes. Example:Chunked Prefill
Breaks long prompts into chunks to maintain low latency. Without chunking:Multi-Model Serving
Data Parallelism (DP):Expert Parallelism (EP)
For Mixture-of-Experts (MoE) models:Disaggregated Serving
Prefill-Decode (PD) Disaggregation:- Independent scaling of prefill and decode
- Better resource utilization
- Lower latency for decode-heavy workloads
Communication & Synchronization
Inter-Process Communication (IPC)
SGLang uses ZMQ for communication between processes:GenerateReqInput: New requestTokenizedResult: Tokenized inputBatchDecodeOutput: Decoded tokensAbortReq: Cancel request
Distributed Communication
For multi-GPU setups, SGLang uses:- NCCL: GPU-to-GPU communication
- PyTorch distributed: Process groups
- RDMA: Low-latency networking (optional)
Request Lifecycle
1. Request Arrival
2. Validation & Tokenization
3. Scheduling
4. Batching & Execution
5. Detokenization & Response
Performance Optimizations
CUDA Graphs
Capture and replay CUDA operations for reduced overhead. Without CUDA graphs:Continuous Batching
Add/remove requests from batches dynamically:Kernel Fusion
Combine multiple operations into single kernels:Directory Structure
Design Principles
1. Separation of Concerns
- Frontend: High-level API and language constructs
- Runtime: Efficient execution and resource management
- Kernels: Low-level optimizations
2. Modularity
- Pluggable attention backends
- Swappable memory allocators
- Flexible scheduling policies
3. Performance First
- Zero-copy wherever possible
- Minimize CPU-GPU synchronization
- Aggressive kernel fusion
- CUDA graphs for low latency
4. Scalability
- Horizontal scaling via data parallelism
- Vertical scaling via tensor/pipeline parallelism
- Disaggregated architectures for large deployments
Key Algorithms
Radix Tree Matching
Token Sampling
Resources
Next Steps
- Scheduler - Deep dive into scheduling
- Memory Management - Memory system details
- Kernel Development - Writing custom kernels
