System Architecture

SGLang is designed as a high-performance serving framework for large language models (LLMs) and multimodal models. The system architecture consists of several key components working together to achieve efficient request handling, memory management, and model execution.

Overview

The SGLang architecture follows a multi-stage pipeline design:

Core Components

Scheduler

The Scheduler is the brain of SGLang’s serving system, located in python/sglang/srt/managers/scheduler.py. It manages:

Request Queue Management: Maintains waiting queues and running batches
Continuous Batching: Dynamically adds and removes requests from batches
Memory Coordination: Works with memory pools to allocate/deallocate KV cache
Policy Execution: Implements scheduling policies (FCFS, LPM, DFS-weight)

The Scheduler uses a running_batch to track currently executing requests and a waiting_queue for pending requests. It continuously decides which requests to admit into the batch based on available memory and scheduling policies.

Key initialization components:

class Scheduler:
    def __init__(self, server_args, port_args, gpu_id, tp_rank, ...):
        # Init model configs
        self.init_model_config()
        
        # Launch model worker
        self.init_model_worker()
        
        # Init cache and memory pool
        self.init_cache_with_memory_pool()
        
        # Init schedule policy
        self.init_schedule_policy()

Reference: python/sglang/srt/managers/scheduler.py:266-412

Model Executor

The Model Executor consists of two main parts:

TpModelWorker

Located in python/sglang/srt/managers/tp_worker.py, the TpModelWorker handles:

Tensor Parallelism: Coordinates model sharding across GPUs
Batch Preparation: Transforms ScheduleBatch to ModelWorkerBatch
Forward Pass Orchestration: Manages generation and embedding inference

class TpModelWorker:
    def forward_batch_generation(self, model_worker_batch):
        forward_batch = ForwardBatch.init_new(
            model_worker_batch, self.model_runner
        )
        return self.model_runner.forward(forward_batch)

Reference: python/sglang/srt/managers/tp_worker.py:206-300

ModelRunner

The ModelRunner (python/sglang/srt/model_executor/model_runner.py) executes the actual model forward passes:

GPU Memory Management: Allocates KV cache and model weights
CUDA Graph Support: Optimizes repeated patterns
Attention Backend Integration: Supports FlashInfer, FlashAttention, Triton, etc.

Memory Manager

SGLang implements a two-level memory pool system in python/sglang/srt/mem_cache/memory_pool.py:

ReqToTokenPool

Maps requests to their token locations:

class ReqToTokenPool:
    def __init__(self, size, max_context_len, device, enable_memory_saver):
        # Store mapping: [req_idx, token_position] -> kv_cache_location
        self.req_to_token = torch.zeros(
            (size, max_context_len), dtype=torch.int32, device=device
        )
        self.free_slots = list(range(size))

Reference: python/sglang/srt/mem_cache/memory_pool.py:126-147

TokenToKVPool

Manages the actual KV cache storage:

MHATokenToKVPool: Multi-head attention KV cache
Paged Memory: Organizes cache in fixed-size pages
FP8 Quantization Support: Optional compression for memory efficiency

The memory pool design enables efficient memory reuse through the RadixCache, allowing multiple requests to share common prefix tokens.

RadixCache

The RadixCache (python/sglang/srt/mem_cache/radix_cache.py) is a prefix tree data structure that:

Detects Shared Prefixes: Automatically identifies common token sequences
Enables Prefix Reuse: Multiple requests share cached KV states
Implements Eviction Policies: LRU, LFU, FIFO, MRU, FILO, priority-based

See RadixAttention and Prefix Caching for detailed explanations.

Data Flow

The data flow through SGLang follows this pattern:

Request Processing Flow

# 1. ScheduleBatch -> ModelWorkerBatch -> ForwardBatch
#    (scheduler.py)    (tp_worker.py)     (model_runner.py)

# ScheduleBatch contains high-level scheduling data (CPU)
class ScheduleBatch:
    reqs: List[Req]  # Request objects
    batch_is_full: bool
    forward_mode: ForwardMode  # EXTEND or DECODE

# ModelWorkerBatch is a subset for model forward (CPU->GPU transfer)
class ModelWorkerBatch:
    # Minimal data needed for forward pass
    ...

# ForwardBatch contains low-level GPU tensors
class ForwardBatch:
    input_ids: torch.Tensor
    req_pool_indices: torch.Tensor
    seq_lens: torch.Tensor
    # ... attention-specific tensors

Reference: python/sglang/srt/managers/schedule_batch.py:20-36

Continuous Batching Loop

The scheduler runs an event loop that continuously processes requests:

def event_loop_normal(self):
    while True:
        # 1. Receive new requests
        recv_reqs = self.recv_requests()
        self.process_input_requests(recv_reqs)
        
        # 2. Get next batch to run
        batch = self.get_next_batch_to_run()
        
        # 3. Execute the batch
        if batch:
            result = self.run_batch(batch)
            self.process_batch_result(batch, result)

Reference: python/sglang/srt/managers/scheduler.py:1110-1129

SGLang also supports an overlapped scheduling mode (event_loop_overlap) that overlaps CPU processing with GPU computation for improved throughput.

Memory Hierarchy

SGLang’s memory system has multiple levels:

GPU KV Cache: Fast on-device storage for active requests
RadixCache Tree: Logical organization of shared prefixes
Evictable/Protected: Memory is either locked (in-use) or evictable
Host Storage (optional): CPU/disk backup for inactive cache

# From radix_cache.py
class RadixCache:
    def evictable_size(self):
        return self.evictable_size_
    
    def protected_size(self):
        # Protected = locked by active requests
        return self.protected_size_

Reference: python/sglang/srt/mem_cache/radix_cache.py:641-646

Scheduling Policies

SGLang supports multiple scheduling policies in python/sglang/srt/managers/schedule_policy.py:

Cache-Aware Policies

LPM (Longest Prefix Match): Prioritizes requests with longer cached prefixes
DFS-Weight: Uses depth-first weighting to maximize cache reuse

Cache-Agnostic Policies

FCFS: First-come-first-served
LOF: Longest output first
Random: Random ordering
Routing-Key: Groups by routing keys for better batching

Cache-aware policies provide better throughput when RadixCache is enabled by maximizing prefix reuse.

Distributed Execution

SGLang supports multiple parallelism strategies:

Tensor Parallelism (TP): Shards model weights across GPUs
Pipeline Parallelism (PP): Splits model layers across devices
Data Parallelism (DP): Replicates model for different request batches
Expert Parallelism (EP): For MoE models, distributes experts

The system coordinates these through NCCL communication groups initialized during startup.

Key Performance Optimizations

1. RadixAttention with Prefix Caching

Automatic detection and reuse of common prefixes across requests. See RadixAttention.

2. Continuous Batching

Dynamic batch composition allows requests to enter/exit without waiting. See Continuous Batching.

3. CUDA Graphs

Captures and replays kernel launches for decode phases, reducing CPU overhead.

4. Paged KV Cache

Non-contiguous memory allocation reduces fragmentation and enables flexible memory management.

5. Chunked Prefill

Large prefill requests can be split into chunks to better interleave with decode requests.

Configuration

Key configuration parameters:

# Memory management
--mem-fraction-static 0.9  # GPU memory for KV cache
--max-running-requests 256  # Batch size limit
--max-total-tokens 16384   # Token capacity

# Scheduling
--schedule-policy lpm      # Use longest prefix match
--schedule-conservativeness 1.0  # Memory headroom

# Caching
--disable-radix-cache      # Disable prefix caching
--chunked-prefill-size 2048  # Enable chunked prefill

RadixAttention - The core attention mechanism
Prefix Caching - How prefix caching works
Continuous Batching - Dynamic request batching

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

System Architecture

Overview

Core Components

Scheduler

Model Executor

TpModelWorker

ModelRunner

Memory Manager

ReqToTokenPool

TokenToKVPool

RadixCache

Data Flow

Request Processing Flow

Continuous Batching Loop

Memory Hierarchy

Scheduling Policies

Cache-Aware Policies

Cache-Agnostic Policies

Distributed Execution

Key Performance Optimizations

1. RadixAttention with Prefix Caching

2. Continuous Batching

3. CUDA Graphs

4. Paged KV Cache

5. Chunked Prefill

Configuration

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Core Components

​Scheduler

​Model Executor

​TpModelWorker

​ModelRunner

​Memory Manager

​ReqToTokenPool

​TokenToKVPool

​RadixCache

​Data Flow

​Request Processing Flow

​Continuous Batching Loop

​Memory Hierarchy

​Scheduling Policies

​Cache-Aware Policies

​Cache-Agnostic Policies

​Distributed Execution

​Key Performance Optimizations

​1. RadixAttention with Prefix Caching

​2. Continuous Batching

​3. CUDA Graphs

​4. Paged KV Cache

​5. Chunked Prefill

​Configuration

​Related Topics

Overview

Core Components

Scheduler

Model Executor

TpModelWorker

ModelRunner

Memory Manager

ReqToTokenPool

TokenToKVPool

RadixCache

Data Flow

Request Processing Flow

Continuous Batching Loop

Memory Hierarchy

Scheduling Policies

Cache-Aware Policies

Cache-Agnostic Policies

Distributed Execution

Key Performance Optimizations

1. RadixAttention with Prefix Caching

2. Continuous Batching

3. CUDA Graphs

4. Paged KV Cache

5. Chunked Prefill

Configuration

Related Topics