Overview
With a compact codebase of ~5,000 lines of Python, Mini-SGLang serves as both a capable inference engine and a transparent reference for researchers and developers. The system is built around a modular architecture that separates concerns across different components.Key Components
The system consists of four main components that communicate via ZeroMQ (ZMQ) for control messages and NCCL (viatorch.distributed) for heavy tensor data exchange between GPUs:
API Server
The entry point for users. It provides an OpenAI-compatible API (e.g.,/v1/chat/completions) to receive prompts and return generated text. The API Server is implemented as a FastAPI application in minisgl.server.api_server.
Tokenizer Worker
Converts input text into numbers (tokens) that the model can understand. The tokenizer worker handles both tokenization requests (text → tokens) and detokenization requests (tokens → text) via thetokenize_worker function.
Detokenizer Worker
Converts the numbers (tokens) generated by the model back into human-readable text. This component streams results back to the API Server for real-time response generation.Scheduler Worker
The core worker process. In a multi-GPU setup, there is one Scheduler Worker for each GPU (referred to as a TP Rank). It manages the computation and resource allocation for that specific GPU. Each scheduler manages:- Engine: The TP worker on a single process that manages the model, context, KV cache, attention backend, and CUDA graph replaying
- Cache Manager: Handles KV cache allocation and eviction (Radix or Naive mode)
- Table Manager: Manages the page table and token pool for requests
- Prefill Manager: Schedules prefill batches with chunked prefill support
- Decode Manager: Schedules decode batches
Architecture Diagram
Request Lifecycle
Here’s how a request flows through the system:- User sends a request to the API Server
- API Server forwards it to the Tokenizer
- Tokenizer converts text to tokens and sends them to the Scheduler (Rank 0)
- Scheduler (Rank 0) broadcasts the request to all other Schedulers (if using multiple GPUs)
- All Schedulers schedule the request and trigger their local Engine to compute the next token
- Scheduler (Rank 0) collects the output token and sends it to the Detokenizer
- Detokenizer converts the token to text and sends it back to the API Server
- API Server streams the result back to the User
For single-GPU deployments, steps 4 remains simple as there’s only one scheduler. Multi-GPU setups use the broadcast mechanism to synchronize requests across all TP ranks.
Code Organization
The source code is located inpython/minisgl. Here’s a breakdown of the modules:
Core Modules
-
minisgl.core: Provides core dataclassesReqandBatchrepresenting the state of requests,Contextwhich holds the global state of the inference context, andSamplingParamsfor user-provided sampling parameters. -
minisgl.engine: Implements theEngineclass, which is a TP worker on a single process. It manages the model, context, KV cache, attention backend, and CUDA graph replaying. -
minisgl.scheduler: Implements theSchedulerclass, which runs on each TP worker process and manages the correspondingEngine. The rank 0 scheduler receives messages from the tokenizer, communicates with schedulers on other TP workers, and sends messages to the detokenizer.
Model and Computation
-
minisgl.models: Implements LLM models including Llama and Qwen3. Also defines utilities for loading weights from HuggingFace and sharding weights. -
minisgl.layers: Implements basic building blocks for building LLMs with TP support, including linear, layernorm, embedding, RoPE, etc. They share common base classes defined inminisgl.layers.base. -
minisgl.attention: Provides interface of attention backends and implements backends of FlashAttention and FlashInfer. They are called byAttentionLayerand use metadata stored inContext.
Memory Management
minisgl.kvcache: Provides interface of KV cache pool and KV cache manager, and implementsMHAKVCache,NaiveCacheManagerandRadixCacheManager.
Distributed Computing
minisgl.distributed: Provides the interface to all-reduce and all-gather in tensor parallelism, and dataclassDistributedInfowhich holds the TP information for a TP worker.
Utilities and Infrastructure
-
minisgl.message: Defines messages exchanged (in ZMQ) between API server, tokenizer, detokenizer and scheduler. All message types support automatic serialization and deserialization. -
minisgl.server: Defines CLI arguments andlaunch_serverwhich starts all the subprocesses of Mini-SGLang. -
minisgl.tokenizer: Implements tokenization and detokenization workers. -
minisgl.kernel: Implements custom CUDA kernels, supported bytvm-ffifor Python binding and JIT interface. -
minisgl.llm: Provides classLLMas a Python interface to interact with the Mini-SGLang system easily. -
minisgl.utils: Collection of utilities, including logger setup and wrappers around ZMQ.
Overlap Scheduling
Mini-SGLang employs overlap scheduling to hide CPU overhead. The scheduler has two main loops:Overlap Loop (Default)
Normal Loop
For debugging purposes, overlap scheduling can be disabled by settingMINISGL_DISABLE_OVERLAP_SCHEDULING=1. This runs scheduling and execution sequentially.
Communication Patterns
Mini-SGLang uses different communication mechanisms for different purposes:- ZeroMQ (ZMQ): For control messages between API server, tokenizer, detokenizer, and scheduler workers. Lightweight and efficient for small message passing.
- NCCL via torch.distributed: For heavy tensor data exchange between GPUs during tensor parallelism. Optimized for high-bandwidth GPU-to-GPU communication.
-
PyNCCL (optional): An alternative to
torch.distributedfor tensor parallel communication, providing custom NCCL bindings for better performance in certain scenarios.
Next Steps
Radix Cache
Learn how KV cache reuse reduces redundant computation
Chunked Prefill
Understand how to serve long-context requests efficiently
Tensor Parallelism
Scale inference across multiple GPUs
Overlap Scheduling
Hide CPU overhead with GPU computation