Architecture Overview
The system consists of several key components that communicate via ZeroMQ (ZMQ) for control messages and NCCL for GPU tensor data:
Core Components
API Server
Implementation:python/minisgl/server/api_server.py
The API Server is the entry point for user requests:
- Provides OpenAI-compatible HTTP endpoints:
/v1/chat/completions- Chat completion API/v1/completions- Text completion API
- Built with FastAPI for async request handling
- Receives text prompts from users
- Returns streaming or non-streaming responses
- Handles request routing to the tokenizer
- Manages response streaming back to clients
- Receives HTTP requests from users
- Sends tokenization requests to Tokenizer via ZMQ
- Receives generated text from Detokenizer via ZMQ
Tokenizer Worker
Implementation:python/minisgl/tokenizer/server.py
The Tokenizer Worker converts text to tokens:
- Runs as an independent process
- Loads the tokenizer model (from HuggingFace)
- Converts input text strings into token IDs
- Handles special tokens and chat templates
- Forwards tokenized requests to Scheduler
- Receives tokenization requests from API Server via ZMQ
- Sends tokenized data to Scheduler (Rank 0) via ZMQ
- Uses
TokenizeMsgmessage type frompython/minisgl/message/tokenizer.py
Detokenizer Worker
Implementation:python/minisgl/tokenizer/detokenize.py and python/minisgl/tokenizer/server.py
The Detokenizer Worker converts tokens back to text:
- Runs within the tokenizer worker process
- Receives token IDs from Scheduler
- Converts tokens into human-readable text
- Handles incremental decoding for streaming
- Sends decoded text back to API Server
- Receives token IDs from Scheduler (Rank 0) via ZMQ
- Sends decoded text to API Server via ZMQ
- Uses
DetokenizeMsgmessage type frompython/minisgl/message/tokenizer.py
Scheduler Worker
Implementation:python/minisgl/scheduler/scheduler.py
The Scheduler is the core orchestrator of inference:
- One scheduler per GPU in multi-GPU (Tensor Parallel) setups
- Each scheduler is called a TP Rank
- Manages request queuing and batching
- Allocates KV cache resources
- Controls the inference engine on its GPU
- Implements continuous batching for efficiency
- Receives tokenized requests from Tokenizer
- Broadcasts requests to all other scheduler ranks
- Collects generated tokens from Engine
- Sends tokens to Detokenizer
- Handles abort and control messages
- Receive broadcast requests from Rank 0
- Run inference in parallel with Rank 0
- Participate in tensor-parallel computation
- Synchronize via NCCL for tensor operations
- Request table management (
python/minisgl/scheduler/table.py) - Batch preparation for prefill and decode (
python/minisgl/scheduler/prefill.py,python/minisgl/scheduler/decode.py) - KV cache allocation (
python/minisgl/scheduler/cache.py) - Message I/O handling (
python/minisgl/scheduler/io.py)
- Rank 0 receives messages via ZMQ from Tokenizer
- Rank 0 sends messages via ZMQ to Detokenizer
- All ranks synchronize via NCCL (torch.distributed) for tensor data
- Inter-scheduler communication for distributed inference
Engine
Implementation:python/minisgl/engine/engine.py
The Engine is the TP worker on a single GPU:
- One engine per GPU process
- Loads and manages the LLM model
- Manages the inference context (
Contextfrompython/minisgl/core.py) - Controls KV cache pool
- Selects and uses attention backend (FlashAttention, FlashInfer, TensorRT-LLM)
- Implements CUDA graph capture for decode optimization
- Performs actual model forward passes
- Executes sampling to generate next tokens
- Model loading and weight sharding
- Attention backend management (
python/minisgl/attention/) - KV cache management (
python/minisgl/kvcache/) - CUDA graph optimization (
python/minisgl/engine/graph.py) - Token sampling (
python/minisgl/engine/sample.py)
- Controlled by local Scheduler via function calls (same process)
- Participates in NCCL collectives for tensor-parallel operations
Communication Protocols
ZeroMQ (ZMQ)
Purpose: Control message passing between processes Implementation:python/minisgl/utils/mp.py
Mini-SGLang uses ZMQ for lightweight inter-process communication:
-
Push/Pull Queues: Point-to-point request/response
ZmqPushQueue- Send messagesZmqPullQueue- Receive messagesZmqAsyncPushQueue,ZmqAsyncPullQueue- Async variants
-
Pub/Sub Queues: Broadcast to multiple subscribers
ZmqPubQueue- Publish messagesZmqSubQueue- Subscribe to messages
- Defined in
python/minisgl/message/ - Support automatic serialization/deserialization
- Type-safe message passing
NCCL (NVIDIA Collective Communications Library)
Purpose: High-performance GPU-to-GPU tensor communication Implementation:python/minisgl/distributed/impl.py, python/minisgl/kernel/pynccl.py
NCCL is used for tensor parallelism:
- All-Reduce: Aggregate results across GPUs (e.g., after row-parallel linear layers)
- All-Gather: Collect tensors from all GPUs
- Broadcast: Send tensor from one GPU to all others
- Synchronizes model weights across TP ranks
- Combines partial results from parallel computations
- Used by tensor-parallel layers in
python/minisgl/layers/
Request Lifecycle
Here’s how a request flows through the system:1. User Request
User → API Server- User sends HTTP POST to
/v1/chat/completions - API Server validates request and extracts prompt
2. Tokenization
API Server → Tokenizer- API Server sends
TokenizeMsgvia ZMQ - Tokenizer converts text to token IDs
3. Scheduling
Tokenizer → Scheduler (Rank 0)- Tokenizer sends tokenized request via ZMQ
- Scheduler creates
Reqobject (python/minisgl/core.py) - Adds request to request table
4. Broadcasting (Multi-GPU)
Scheduler (Rank 0) → All Schedulers- Rank 0 broadcasts request to other ranks via NCCL
- All schedulers synchronize request state
5. Batch Preparation
All Schedulers- Each scheduler prepares
Batchobject - Allocates KV cache pages
- Creates attention metadata
- Groups prefill and decode requests
6. Inference
Scheduler → Engine- Scheduler calls Engine.forward()
- Engine performs model forward pass
- Tensor-parallel layers use NCCL for synchronization
- Attention backend computes attention
- KV cache is updated
- Token sampling generates next token
7. Detokenization
Scheduler (Rank 0) → Detokenizer- Rank 0 sends generated token ID via ZMQ
- Detokenizer converts to text
- Handles streaming decoding
8. Response
Detokenizer → API Server → User- Detokenizer sends text chunk via ZMQ
- API Server streams to user via HTTP
- Process repeats until EOS or max tokens
Multi-GPU Tensor Parallelism
In tensor-parallel setups:-
Model Sharding:
- Model weights are split across GPUs
- Each GPU holds a portion of each layer
- Sharding handled by
python/minisgl/models/weight.py
-
Parallel Computation:
- All GPUs process the same batch simultaneously
- Column-parallel layers split output features
- Row-parallel layers split input features
- Implemented in
python/minisgl/layers/linear.py
-
Synchronization:
- NCCL all-reduce after row-parallel layers
- Fast GPU-to-GPU communication
- Minimal overhead for tensor operations
-
Coordinator Pattern:
- Rank 0 scheduler coordinates I/O
- All ranks execute inference in lockstep
- Results collected at Rank 0
Process Launching
Implementation:python/minisgl/server/launch.py
The launch_server function orchestrates all processes:
- Spawns API Server process
- Spawns Tokenizer/Detokenizer process
- Spawns Scheduler processes (one per GPU)
- Each Scheduler creates its Engine
- Sets up ZMQ and NCCL communication
- Monitors process health
python/minisgl/__main__.py
Users can start the entire system with: