System Architecture

Mini-SGLang is designed as a distributed system to handle Large Language Model (LLM) inference efficiently. It consists of several independent processes working together to provide high-performance serving.

Overview

With a compact codebase of ~5,000 lines of Python, Mini-SGLang serves as both a capable inference engine and a transparent reference for researchers and developers. The system is built around a modular architecture that separates concerns across different components.

Key Components

The system consists of four main components that communicate via ZeroMQ (ZMQ) for control messages and NCCL (via torch.distributed) for heavy tensor data exchange between GPUs:

API Server

The entry point for users. It provides an OpenAI-compatible API (e.g., /v1/chat/completions) to receive prompts and return generated text. The API Server is implemented as a FastAPI application in minisgl.server.api_server.

Tokenizer Worker

Converts input text into numbers (tokens) that the model can understand. The tokenizer worker handles both tokenization requests (text → tokens) and detokenization requests (tokens → text) via the tokenize_worker function.

Detokenizer Worker

Converts the numbers (tokens) generated by the model back into human-readable text. This component streams results back to the API Server for real-time response generation.

Scheduler Worker

The core worker process. In a multi-GPU setup, there is one Scheduler Worker for each GPU (referred to as a TP Rank). It manages the computation and resource allocation for that specific GPU. Each scheduler manages:

Engine: The TP worker on a single process that manages the model, context, KV cache, attention backend, and CUDA graph replaying
Cache Manager: Handles KV cache allocation and eviction (Radix or Naive mode)
Table Manager: Manages the page table and token pool for requests
Prefill Manager: Schedules prefill batches with chunked prefill support
Decode Manager: Schedules decode batches

Architecture Diagram

Request Lifecycle

Here’s how a request flows through the system:

User sends a request to the API Server
API Server forwards it to the Tokenizer
Tokenizer converts text to tokens and sends them to the Scheduler (Rank 0)
Scheduler (Rank 0) broadcasts the request to all other Schedulers (if using multiple GPUs)
All Schedulers schedule the request and trigger their local Engine to compute the next token
Scheduler (Rank 0) collects the output token and sends it to the Detokenizer
Detokenizer converts the token to text and sends it back to the API Server
API Server streams the result back to the User

For single-GPU deployments, steps 4 remains simple as there’s only one scheduler. Multi-GPU setups use the broadcast mechanism to synchronize requests across all TP ranks.

Code Organization

The source code is located in python/minisgl. Here’s a breakdown of the modules:

Core Modules

minisgl.core: Provides core dataclasses Req and Batch representing the state of requests, Context which holds the global state of the inference context, and SamplingParams for user-provided sampling parameters.
minisgl.engine: Implements the Engine class, which is a TP worker on a single process. It manages the model, context, KV cache, attention backend, and CUDA graph replaying.
minisgl.scheduler: Implements the Scheduler class, which runs on each TP worker process and manages the corresponding Engine. The rank 0 scheduler receives messages from the tokenizer, communicates with schedulers on other TP workers, and sends messages to the detokenizer.

Model and Computation

minisgl.models: Implements LLM models including Llama and Qwen3. Also defines utilities for loading weights from HuggingFace and sharding weights.
minisgl.layers: Implements basic building blocks for building LLMs with TP support, including linear, layernorm, embedding, RoPE, etc. They share common base classes defined in minisgl.layers.base.
minisgl.attention: Provides interface of attention backends and implements backends of FlashAttention and FlashInfer. They are called by AttentionLayer and use metadata stored in Context.

Memory Management

minisgl.kvcache: Provides interface of KV cache pool and KV cache manager, and implements MHAKVCache, NaiveCacheManager and RadixCacheManager.

Distributed Computing

minisgl.distributed: Provides the interface to all-reduce and all-gather in tensor parallelism, and dataclass DistributedInfo which holds the TP information for a TP worker.

Utilities and Infrastructure

minisgl.message: Defines messages exchanged (in ZMQ) between API server, tokenizer, detokenizer and scheduler. All message types support automatic serialization and deserialization.
minisgl.server: Defines CLI arguments and launch_server which starts all the subprocesses of Mini-SGLang.
minisgl.tokenizer: Implements tokenization and detokenization workers.
minisgl.kernel: Implements custom CUDA kernels, supported by tvm-ffi for Python binding and JIT interface.
minisgl.llm: Provides class LLM as a Python interface to interact with the Mini-SGLang system easily.
minisgl.utils: Collection of utilities, including logger setup and wrappers around ZMQ.

Overlap Scheduling

Mini-SGLang employs overlap scheduling to hide CPU overhead. The scheduler has two main loops:

Overlap Loop (Default)

def overlap_loop(self, last_data: ForwardData | None) -> ForwardData | None:
    # Receive messages and process
    for msg in self.receive_msg(blocking=False):
        self._process_one_msg(msg)
    
    # Schedule next batch
    forward_input = self._schedule_next_batch()
    ongoing_data = None
    if forward_input is not None:
        # Run batch in engine's stream
        with self.engine_stream_ctx:
            self.engine.stream.wait_stream(self.stream)
            ongoing_data = (forward_input, self._forward(forward_input))
    
    # Process last batch's results while GPU works on current batch
    self._process_last_data(last_data)
    return ongoing_data

This approach overlaps the execution of the current batch with processing of the last batch’s results, effectively hiding CPU latency and improving GPU utilization.

Normal Loop

For debugging purposes, overlap scheduling can be disabled by setting MINISGL_DISABLE_OVERLAP_SCHEDULING=1. This runs scheduling and execution sequentially.

The overlap scheduling technique is based on the NanoFlow paper and significantly improves throughput by maximizing GPU utilization.

Communication Patterns

Mini-SGLang uses different communication mechanisms for different purposes:

ZeroMQ (ZMQ): For control messages between API server, tokenizer, detokenizer, and scheduler workers. Lightweight and efficient for small message passing.
NCCL via torch.distributed: For heavy tensor data exchange between GPUs during tensor parallelism. Optimized for high-bandwidth GPU-to-GPU communication.
PyNCCL (optional): An alternative to torch.distributed for tensor parallel communication, providing custom NCCL bindings for better performance in certain scenarios.

Next Steps

Radix Cache

Learn how KV cache reuse reduces redundant computation

Chunked Prefill

Understand how to serve long-context requests efficiently

Tensor Parallelism

Scale inference across multiple GPUs

Overlap Scheduling

Hide CPU overhead with GPU computation

Getting Started

Core Concepts

Guides

Configuration

Performance

Overview

Key Components

API Server

Tokenizer Worker

Detokenizer Worker

Scheduler Worker

Architecture Diagram

Request Lifecycle

Code Organization

Core Modules

Model and Computation

Memory Management

Distributed Computing

Utilities and Infrastructure

Overlap Scheduling

Overlap Loop (Default)

Normal Loop

Communication Patterns

Next Steps

Radix Cache

Chunked Prefill

Tensor Parallelism

Overlap Scheduling

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Overview

​Key Components

​API Server

​Tokenizer Worker

​Detokenizer Worker

​Scheduler Worker

​Architecture Diagram

​Request Lifecycle

​Code Organization

​Core Modules

​Model and Computation

​Memory Management

​Distributed Computing

​Utilities and Infrastructure

​Overlap Scheduling

​Overlap Loop (Default)

​Normal Loop

​Communication Patterns

​Next Steps

Radix Cache

Chunked Prefill

Tensor Parallelism

Overlap Scheduling

Build docs developers (and LLMs) love

Overview

Key Components

API Server

Tokenizer Worker

Detokenizer Worker

Scheduler Worker

Architecture Diagram

Request Lifecycle

Code Organization

Core Modules

Model and Computation

Memory Management

Distributed Computing

Utilities and Infrastructure

Overlap Scheduling

Overlap Loop (Default)

Normal Loop

Communication Patterns

Next Steps