Skip to main content
Mini-SGLang provides comprehensive command-line options for configuring model serving, performance optimization, memory management, and networking. This page documents all available CLI arguments from python -m minisgl.

Model Configuration

--model-path
string
required
Path to model weights. Can be a local folder or Hugging Face repo ID.Alias: --modelExample:
python -m minisgl --model "Qwen/Qwen3-0.6B"
python -m minisgl --model-path "/local/path/to/model"
--dtype
string
default:"auto"
Data type for model weights and activations.Choices: auto, float16, bfloat16, float32auto will use FP16 for FP32/FP16 models and BF16 for BF16 models.Example:
python -m minisgl --model "meta-llama/Llama-3-8B" --dtype bfloat16
--model-source
string
default:"huggingface"
Source to download model from.Choices: huggingface, modelscopeExample:
python -m minisgl --model "Qwen/Qwen3-0.6B" --model-source modelscope

Performance Configuration

--tensor-parallel-size
int
default:"1"
Tensor parallelism size for distributed serving across multiple GPUs.Alias: --tp-sizeExample:
# Run on 4 GPUs
python -m minisgl --model "meta-llama/Llama-3-70B" --tp-size 4
--max-running-requests
int
default:"256"
Maximum number of concurrent running requests.Example:
python -m minisgl --model "Qwen/Qwen3-0.6B" --max-running-requests 128
--max-prefill-length
int
default:"8192"
Maximum chunk size in tokens for chunked prefill. Controls the maximum number of tokens processed in a single prefill iteration.Alias: --max-extend-lengthSetting this to a very small value (e.g., 128) is not recommended as it may significantly degrade performance.Example:
python -m minisgl --model "Qwen/Qwen3-0.6B" --max-prefill-length 4096
--attention-backend
string
default:"auto"
Attention backend to use. If two backends are specified (comma-separated), the first is used for prefill and the second for decode.Alias: --attnChoices: auto, fa (FlashAttention), fi (FlashInfer), trtllm (TensorRT-LLM)See Attention Backends for detailed information.Example:
# Use FlashAttention for prefill and FlashInfer for decode
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa,fi

# Use same backend for both phases
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa
--cuda-graph-max-bs
int
default:"auto"
Maximum batch size for CUDA graph capture. Setting to 0 disables CUDA graph optimization. When not specified, the value is auto-tuned based on GPU memory.Alias: --graphSee CUDA Graph for detailed information.Example:
# Set maximum batch size to 32
python -m minisgl --model "Qwen/Qwen3-0.6B" --cuda-graph-max-bs 32

# Disable CUDA graph
python -m minisgl --model "Qwen/Qwen3-0.6B" --cuda-graph-max-bs 0
--moe-backend
string
default:"auto"
MoE (Mixture of Experts) backend to use for MoE models.Choices: auto, and other supported MoE backendsExample:
python -m minisgl --model "Qwen/Qwen3-MoE" --moe-backend auto

Memory Configuration

--memory-ratio
float
default:"0.9"
Fraction of GPU memory to use for KV cache. Value should be between 0 and 1.Example:
# Use 85% of GPU memory for KV cache
python -m minisgl --model "Qwen/Qwen3-0.6B" --memory-ratio 0.85
--max-seq-len-override
int
default:"null"
Override the maximum sequence length from model config.Example:
python -m minisgl --model "Qwen/Qwen3-0.6B" --max-seq-len-override 32768
--num-pages
int
default:"null"
Set the maximum number of pages for KV cache. Overrides automatic calculation based on memory ratio.Example:
python -m minisgl --model "Qwen/Qwen3-0.6B" --num-pages 10000
--page-size
int
default:"1"
Page size for KV cache management system. Some attention backends may override this value (e.g., TRT-LLM only supports 16, 32, or 64).Example:
python -m minisgl --model "Qwen/Qwen3-0.6B" --page-size 16
--cache-type
string
default:"radix"
KV cache management strategy.Choices: radix, naiveRadix cache allows reuse of KV cache for shared prefixes across requests. See Cache Management for detailed information.Example:
# Use naive cache (no prefix sharing)
python -m minisgl --model "Qwen/Qwen3-0.6B" --cache-type naive

Network Configuration

--host
string
default:"127.0.0.1"
Host address for the server to bind to.Example:
# Listen on all network interfaces
python -m minisgl --model "Qwen/Qwen3-0.6B" --host 0.0.0.0
--port
int
default:"1919"
Port number for the server to listen on.Example:
python -m minisgl --model "Qwen/Qwen3-0.6B" --port 8000
--num-tokenizer
int
default:"0"
Number of tokenizer processes to launch. 0 means the tokenizer is shared with the detokenizer.Alias: --tokenizer-countExample:
# Launch 2 dedicated tokenizer processes
python -m minisgl --model "Qwen/Qwen3-0.6B" --num-tokenizer 2

Advanced Options

--disable-pynccl
boolean
default:"false"
Disable PyNCCL for tensor parallelism. By default, PyNCCL is enabled.Example:
python -m minisgl --model "meta-llama/Llama-3-70B" --tp-size 4 --disable-pynccl
--dummy-weight
boolean
default:"false"
Use dummy weights for testing purposes instead of loading actual model weights.Example:
python -m minisgl --model "Qwen/Qwen3-0.6B" --dummy-weight
--shell-mode
boolean
default:"false"
Run the server in interactive shell mode for demonstration and testing.When enabled, automatically sets --cuda-graph-max-bs 1 and --max-running-requests 1.Example:
python -m minisgl --model "Qwen/Qwen3-0.6B" --shell-mode

Common Usage Examples

Basic serving

python -m minisgl --model "Qwen/Qwen3-0.6B"

High-throughput serving

python -m minisgl \
  --model "meta-llama/Llama-3-8B" \
  --max-running-requests 512 \
  --memory-ratio 0.95 \
  --attn fa,fi

Multi-GPU serving

python -m minisgl \
  --model "meta-llama/Llama-3-70B" \
  --tp-size 8 \
  --dtype bfloat16 \
  --max-running-requests 256

Long context serving

python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --max-seq-len-override 32768 \
  --max-prefill-length 4096 \
  --memory-ratio 0.9

Testing with shell mode

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell-mode

Build docs developers (and LLMs) love