Server Options

Mini-SGLang provides comprehensive command-line options for configuring model serving, performance optimization, memory management, and networking. This page documents all available CLI arguments from python -m minisgl.

Model Configuration

--model-path

string

required

Path to model weights. Can be a local folder or Hugging Face repo ID.Alias: --modelExample:

python -m minisgl --model "Qwen/Qwen3-0.6B"
python -m minisgl --model-path "/local/path/to/model"

--dtype

string

default:"auto"

Data type for model weights and activations.Choices: auto, float16, bfloat16, float32auto will use FP16 for FP32/FP16 models and BF16 for BF16 models.Example:

python -m minisgl --model "meta-llama/Llama-3-8B" --dtype bfloat16

--model-source

string

default:"huggingface"

Source to download model from.Choices: huggingface, modelscopeExample:

python -m minisgl --model "Qwen/Qwen3-0.6B" --model-source modelscope

Performance Configuration

--tensor-parallel-size

int

default:"1"

Tensor parallelism size for distributed serving across multiple GPUs.Alias: --tp-sizeExample:

# Run on 4 GPUs
python -m minisgl --model "meta-llama/Llama-3-70B" --tp-size 4

--max-running-requests

int

default:"256"

Maximum number of concurrent running requests.Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --max-running-requests 128

--max-prefill-length

int

default:"8192"

Maximum chunk size in tokens for chunked prefill. Controls the maximum number of tokens processed in a single prefill iteration.Alias: --max-extend-lengthSetting this to a very small value (e.g., 128) is not recommended as it may significantly degrade performance.Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --max-prefill-length 4096

--attention-backend

string

default:"auto"

Attention backend to use. If two backends are specified (comma-separated), the first is used for prefill and the second for decode.Alias: --attnChoices: auto, fa (FlashAttention), fi (FlashInfer), trtllm (TensorRT-LLM)See Attention Backends for detailed information.Example:

# Use FlashAttention for prefill and FlashInfer for decode
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa,fi

# Use same backend for both phases
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa

--cuda-graph-max-bs

int

default:"auto"

Maximum batch size for CUDA graph capture. Setting to 0 disables CUDA graph optimization. When not specified, the value is auto-tuned based on GPU memory.Alias: --graphSee CUDA Graph for detailed information.Example:

# Set maximum batch size to 32
python -m minisgl --model "Qwen/Qwen3-0.6B" --cuda-graph-max-bs 32

# Disable CUDA graph
python -m minisgl --model "Qwen/Qwen3-0.6B" --cuda-graph-max-bs 0

--moe-backend

string

default:"auto"

MoE (Mixture of Experts) backend to use for MoE models.Choices: auto, and other supported MoE backendsExample:

python -m minisgl --model "Qwen/Qwen3-MoE" --moe-backend auto

Memory Configuration

--memory-ratio

float

default:"0.9"

Fraction of GPU memory to use for KV cache. Value should be between 0 and 1.Example:

# Use 85% of GPU memory for KV cache
python -m minisgl --model "Qwen/Qwen3-0.6B" --memory-ratio 0.85

--max-seq-len-override

int

default:"null"

Override the maximum sequence length from model config.Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --max-seq-len-override 32768

--num-pages

int

default:"null"

Set the maximum number of pages for KV cache. Overrides automatic calculation based on memory ratio.Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --num-pages 10000

--page-size

int

default:"1"

Page size for KV cache management system. Some attention backends may override this value (e.g., TRT-LLM only supports 16, 32, or 64).Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --page-size 16

--cache-type

string

default:"radix"

KV cache management strategy.Choices: radix, naiveRadix cache allows reuse of KV cache for shared prefixes across requests. See Cache Management for detailed information.Example:

# Use naive cache (no prefix sharing)
python -m minisgl --model "Qwen/Qwen3-0.6B" --cache-type naive

Network Configuration

--host

string

default:"127.0.0.1"

Host address for the server to bind to.Example:

# Listen on all network interfaces
python -m minisgl --model "Qwen/Qwen3-0.6B" --host 0.0.0.0

--port

int

default:"1919"

Port number for the server to listen on.Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --port 8000

--num-tokenizer

int

default:"0"

Number of tokenizer processes to launch. 0 means the tokenizer is shared with the detokenizer.Alias: --tokenizer-countExample:

# Launch 2 dedicated tokenizer processes
python -m minisgl --model "Qwen/Qwen3-0.6B" --num-tokenizer 2

Advanced Options

--disable-pynccl

boolean

default:"false"

Disable PyNCCL for tensor parallelism. By default, PyNCCL is enabled.Example:

python -m minisgl --model "meta-llama/Llama-3-70B" --tp-size 4 --disable-pynccl

--dummy-weight

boolean

default:"false"

Use dummy weights for testing purposes instead of loading actual model weights.Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --dummy-weight

--shell-mode

boolean

default:"false"

Run the server in interactive shell mode for demonstration and testing.When enabled, automatically sets --cuda-graph-max-bs 1 and --max-running-requests 1.Example:

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell-mode

Common Usage Examples

Basic serving

python -m minisgl --model "Qwen/Qwen3-0.6B"

High-throughput serving

python -m minisgl \
  --model "meta-llama/Llama-3-8B" \
  --max-running-requests 512 \
  --memory-ratio 0.95 \
  --attn fa,fi

Multi-GPU serving

python -m minisgl \
  --model "meta-llama/Llama-3-70B" \
  --tp-size 8 \
  --dtype bfloat16 \
  --max-running-requests 256

Long context serving

python -m minisgl \
  --model "Qwen/Qwen3-0.6B" \
  --max-seq-len-override 32768 \
  --max-prefill-length 4096 \
  --memory-ratio 0.9

Testing with shell mode

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell-mode

Getting Started

Core Concepts

Guides

Configuration

Performance

Model Configuration

Performance Configuration

Memory Configuration

Network Configuration

Advanced Options

Common Usage Examples

Basic serving

High-throughput serving

Multi-GPU serving

Long context serving

Testing with shell mode

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Model Configuration

​Performance Configuration

​Memory Configuration

​Network Configuration

​Advanced Options

​Common Usage Examples

​Basic serving

​High-throughput serving

​Multi-GPU serving

​Long context serving

​Testing with shell mode

Build docs developers (and LLMs) love

Model Configuration

Performance Configuration

Memory Configuration

Network Configuration

Advanced Options

Common Usage Examples

Basic serving

High-throughput serving

Multi-GPU serving

Long context serving

Testing with shell mode