python -m minisgl.
Model Configuration
Path to model weights. Can be a local folder or Hugging Face repo ID.Alias:
--modelExample:Data type for model weights and activations.Choices:
auto, float16, bfloat16, float32auto will use FP16 for FP32/FP16 models and BF16 for BF16 models.Example:Source to download model from.Choices:
huggingface, modelscopeExample:Performance Configuration
Tensor parallelism size for distributed serving across multiple GPUs.Alias:
--tp-sizeExample:Maximum number of concurrent running requests.Example:
Maximum chunk size in tokens for chunked prefill. Controls the maximum number of tokens processed in a single prefill iteration.Alias:
--max-extend-lengthSetting this to a very small value (e.g., 128) is not recommended as it may significantly degrade performance.Example:Attention backend to use. If two backends are specified (comma-separated), the first is used for prefill and the second for decode.Alias:
--attnChoices: auto, fa (FlashAttention), fi (FlashInfer), trtllm (TensorRT-LLM)See Attention Backends for detailed information.Example:Maximum batch size for CUDA graph capture. Setting to 0 disables CUDA graph optimization. When not specified, the value is auto-tuned based on GPU memory.Alias:
--graphSee CUDA Graph for detailed information.Example:MoE (Mixture of Experts) backend to use for MoE models.Choices:
auto, and other supported MoE backendsExample:Memory Configuration
Fraction of GPU memory to use for KV cache. Value should be between 0 and 1.Example:
Override the maximum sequence length from model config.Example:
Set the maximum number of pages for KV cache. Overrides automatic calculation based on memory ratio.Example:
Page size for KV cache management system. Some attention backends may override this value (e.g., TRT-LLM only supports 16, 32, or 64).Example:
KV cache management strategy.Choices:
radix, naiveRadix cache allows reuse of KV cache for shared prefixes across requests. See Cache Management for detailed information.Example:Network Configuration
Host address for the server to bind to.Example:
Port number for the server to listen on.Example:
Number of tokenizer processes to launch. 0 means the tokenizer is shared with the detokenizer.Alias:
--tokenizer-countExample:Advanced Options
Disable PyNCCL for tensor parallelism. By default, PyNCCL is enabled.Example:
Use dummy weights for testing purposes instead of loading actual model weights.Example:
Run the server in interactive shell mode for demonstration and testing.When enabled, automatically sets
--cuda-graph-max-bs 1 and --max-running-requests 1.Example: