ServerArgs
TheServerArgs class contains all configuration options for launching an SGLang server or engine. These arguments control model loading, memory management, parallelism, kernel backends, and optimization settings.
Usage
Model and Tokenizer
Path to the model on Hugging Face Hub or local filesystem.
Path to tokenizer. Defaults to
model_path if not specified.Tokenizer mode. Options:
"auto", "slow", "fast".Skip tokenizer initialization. Useful when passing pre-tokenized input_ids.
Model weight loading format.Options:
"auto", "pt", "safetensors", "npcache", "dummy", "gguf", "bitsandbytes"Trust remote code when loading models from Hugging Face.
Maximum context length. Auto-detected from model config if not specified.
Model revision (branch, tag, or commit) to use from Hugging Face.
HTTP Server
Server host address.
Server port number.
API key for authentication.
Model name to report in API responses. Defaults to
model_path.Data Type and Quantization
Data type for model weights and computation.Options:
"auto", "float16", "bfloat16", "float32"Quantization method.Options:
"awq", "fp8", "gptq", "marlin", "bitsandbytes", "gguf", and more.Data type for KV cache.Options:
"auto", "fp8_e4m3", "fp8_e5m2", "bfloat16", "float16"Using FP8 for KV cache can significantly reduce memory usage.Memory and Scheduling
Fraction of GPU memory to use for model weights and KV cache.Auto-calculated based on GPU memory capacity if not specified.
Maximum total tokens in the KV cache pool.This is the maximum number of tokens that can be cached across all requests.
Maximum number of requests to process simultaneously.
Maximum number of requests to queue when busy.
Chunk size for chunked prefill.Auto-calculated based on GPU memory capacity if not specified.
- Small GPUs (<20GB): 2048
- Medium GPUs (20-60GB): 4096
- Large GPUs (>60GB): 8192+
Maximum tokens for prefill phase.
Scheduling policy. Options:
"fcfs" (first-come-first-serve), "lpm" (longest-prefix-match).Enable priority-based request scheduling.
Parallelism
Tensor parallelism size (number of GPUs for model parallelism).
Data parallelism size (number of independent model replicas).
Pipeline parallelism size (number of pipeline stages).
Number of nodes in a multi-node setup.
Current node rank (0 to nnodes-1).
Kernel Backends
Attention kernel backend.Options:
"flashinfer", "flashinfer", "triton", "torch_native", "fa3" (FlashAttention-3)Auto-selected based on hardware if not specified.Sampling backend. Options:
"flashinfer", "pytorch"Structured generation backend.Options:
"xgrammar", "outlines", "llguidance", "none"CUDA Graph Optimization
Disable CUDA graph optimization.
Maximum batch size for CUDA graph capture.Auto-calculated based on GPU memory:
- Small GPUs: 8-24
- Medium GPUs: 32-160
- Large GPUs: 256-512
Disable padding in CUDA graph batch sizes.
Speculative Decoding
Speculative decoding algorithm.Options:
"EAGLE", "STANDALONE", "NGRAM"Path to draft model for speculative decoding.
Number of speculative decoding steps.
Number of draft tokens to generate per step.
LoRA
Enable LoRA adapter support.
Maximum LoRA rank to support.
Paths to LoRA adapters to pre-load.
Maximum number of LoRA adapters to keep loaded.
LoRA kernel backend. Options:
"triton", "csgmv", "torch_native"Expert Parallelism (MoE)
Expert parallelism size for Mixture-of-Experts models.
MoE kernel backend.Options:
"auto", "triton", "flashinfer_cutlass", "deep_gemm"All-to-all communication backend for MoE.Options:
"none", "deepep", "mooncake"Logging and Monitoring
Logging level. Options:
"debug", "info", "warning", "error"Log all requests and responses.
Show time cost for each request.
Enable Prometheus metrics.
Enable OpenTelemetry tracing.
OpenTelemetry collector endpoint.
Advanced Options
Disable radix cache (prefix caching) optimization.
Random seed for reproducibility. Auto-generated if not specified.
Token interval for streaming responses.
Directory for downloading models from Hugging Face.
Enable PyTorch compilation for model optimization.
Device to use. Options:
"cuda", "cpu", "npu". Auto-detected if not specified.Configuration Examples
Basic Configuration
Production Configuration
Quantized Model
Multi-LoRA Configuration
Data Parallelism
Speculative Decoding
Multi-Node Configuration
See Also
- Engine - Main inference engine
- Runtime - HTTP server wrapper
- SamplingParams - Sampling configuration
