Skip to main content
Mini-SGLang is designed as a distributed system to handle Large Language Model (LLM) inference efficiently. It consists of several independent processes working together to provide high-performance serving.

Overview

With a compact codebase of ~5,000 lines of Python, Mini-SGLang serves as both a capable inference engine and a transparent reference for researchers and developers. The system is built around a modular architecture that separates concerns across different components.

Key Components

The system consists of four main components that communicate via ZeroMQ (ZMQ) for control messages and NCCL (via torch.distributed) for heavy tensor data exchange between GPUs:

API Server

The entry point for users. It provides an OpenAI-compatible API (e.g., /v1/chat/completions) to receive prompts and return generated text. The API Server is implemented as a FastAPI application in minisgl.server.api_server.

Tokenizer Worker

Converts input text into numbers (tokens) that the model can understand. The tokenizer worker handles both tokenization requests (text → tokens) and detokenization requests (tokens → text) via the tokenize_worker function.

Detokenizer Worker

Converts the numbers (tokens) generated by the model back into human-readable text. This component streams results back to the API Server for real-time response generation.

Scheduler Worker

The core worker process. In a multi-GPU setup, there is one Scheduler Worker for each GPU (referred to as a TP Rank). It manages the computation and resource allocation for that specific GPU. Each scheduler manages:
  • Engine: The TP worker on a single process that manages the model, context, KV cache, attention backend, and CUDA graph replaying
  • Cache Manager: Handles KV cache allocation and eviction (Radix or Naive mode)
  • Table Manager: Manages the page table and token pool for requests
  • Prefill Manager: Schedules prefill batches with chunked prefill support
  • Decode Manager: Schedules decode batches

Architecture Diagram

Mini-SGLang Architecture

Request Lifecycle

Here’s how a request flows through the system:
  1. User sends a request to the API Server
  2. API Server forwards it to the Tokenizer
  3. Tokenizer converts text to tokens and sends them to the Scheduler (Rank 0)
  4. Scheduler (Rank 0) broadcasts the request to all other Schedulers (if using multiple GPUs)
  5. All Schedulers schedule the request and trigger their local Engine to compute the next token
  6. Scheduler (Rank 0) collects the output token and sends it to the Detokenizer
  7. Detokenizer converts the token to text and sends it back to the API Server
  8. API Server streams the result back to the User
For single-GPU deployments, steps 4 remains simple as there’s only one scheduler. Multi-GPU setups use the broadcast mechanism to synchronize requests across all TP ranks.

Code Organization

The source code is located in python/minisgl. Here’s a breakdown of the modules:

Core Modules

  • minisgl.core: Provides core dataclasses Req and Batch representing the state of requests, Context which holds the global state of the inference context, and SamplingParams for user-provided sampling parameters.
  • minisgl.engine: Implements the Engine class, which is a TP worker on a single process. It manages the model, context, KV cache, attention backend, and CUDA graph replaying.
  • minisgl.scheduler: Implements the Scheduler class, which runs on each TP worker process and manages the corresponding Engine. The rank 0 scheduler receives messages from the tokenizer, communicates with schedulers on other TP workers, and sends messages to the detokenizer.

Model and Computation

  • minisgl.models: Implements LLM models including Llama and Qwen3. Also defines utilities for loading weights from HuggingFace and sharding weights.
  • minisgl.layers: Implements basic building blocks for building LLMs with TP support, including linear, layernorm, embedding, RoPE, etc. They share common base classes defined in minisgl.layers.base.
  • minisgl.attention: Provides interface of attention backends and implements backends of FlashAttention and FlashInfer. They are called by AttentionLayer and use metadata stored in Context.

Memory Management

  • minisgl.kvcache: Provides interface of KV cache pool and KV cache manager, and implements MHAKVCache, NaiveCacheManager and RadixCacheManager.

Distributed Computing

  • minisgl.distributed: Provides the interface to all-reduce and all-gather in tensor parallelism, and dataclass DistributedInfo which holds the TP information for a TP worker.

Utilities and Infrastructure

  • minisgl.message: Defines messages exchanged (in ZMQ) between API server, tokenizer, detokenizer and scheduler. All message types support automatic serialization and deserialization.
  • minisgl.server: Defines CLI arguments and launch_server which starts all the subprocesses of Mini-SGLang.
  • minisgl.tokenizer: Implements tokenization and detokenization workers.
  • minisgl.kernel: Implements custom CUDA kernels, supported by tvm-ffi for Python binding and JIT interface.
  • minisgl.llm: Provides class LLM as a Python interface to interact with the Mini-SGLang system easily.
  • minisgl.utils: Collection of utilities, including logger setup and wrappers around ZMQ.

Overlap Scheduling

Mini-SGLang employs overlap scheduling to hide CPU overhead. The scheduler has two main loops:

Overlap Loop (Default)

def overlap_loop(self, last_data: ForwardData | None) -> ForwardData | None:
    # Receive messages and process
    for msg in self.receive_msg(blocking=False):
        self._process_one_msg(msg)
    
    # Schedule next batch
    forward_input = self._schedule_next_batch()
    ongoing_data = None
    if forward_input is not None:
        # Run batch in engine's stream
        with self.engine_stream_ctx:
            self.engine.stream.wait_stream(self.stream)
            ongoing_data = (forward_input, self._forward(forward_input))
    
    # Process last batch's results while GPU works on current batch
    self._process_last_data(last_data)
    return ongoing_data
This approach overlaps the execution of the current batch with processing of the last batch’s results, effectively hiding CPU latency and improving GPU utilization.

Normal Loop

For debugging purposes, overlap scheduling can be disabled by setting MINISGL_DISABLE_OVERLAP_SCHEDULING=1. This runs scheduling and execution sequentially.
The overlap scheduling technique is based on the NanoFlow paper and significantly improves throughput by maximizing GPU utilization.

Communication Patterns

Mini-SGLang uses different communication mechanisms for different purposes:
  • ZeroMQ (ZMQ): For control messages between API server, tokenizer, detokenizer, and scheduler workers. Lightweight and efficient for small message passing.
  • NCCL via torch.distributed: For heavy tensor data exchange between GPUs during tensor parallelism. Optimized for high-bandwidth GPU-to-GPU communication.
  • PyNCCL (optional): An alternative to torch.distributed for tensor parallel communication, providing custom NCCL bindings for better performance in certain scenarios.

Next Steps

Radix Cache

Learn how KV cache reuse reduces redundant computation

Chunked Prefill

Understand how to serve long-context requests efficiently

Tensor Parallelism

Scale inference across multiple GPUs

Overlap Scheduling

Hide CPU overhead with GPU computation

Build docs developers (and LLMs) love