Overview
This page documents all available server arguments for launching SGLang. These can be passed via command-line flags or programmatically when creating anEngine.
Model and Tokenizer
Path to model weights. Can be a local folder or HuggingFace repo ID.Examples:
meta-llama/Llama-3.1-8B-Instruct, /local/path/to/modelPath to tokenizer. Defaults to
model-path if not specified.Tokenizer mode. Options:
auto (use fast tokenizer if available), slow (always use slow tokenizer).Skip tokenizer initialization. When enabled, you must pass
input_ids directly to generate.Model weight format to load.Options:
auto, pt, safetensors, npcache, dummy, gguf, bitsandbytes, layered, remoteAllow custom models from HuggingFace Hub with custom modeling files.
Maximum context length. Defaults to value from model’s config.json.
Use a causal LM as an embedding model.
Enable multimodal functionality for vision/audio models.
Model version (branch name, tag, or commit ID) to use from HuggingFace.
HTTP Server
Host address for the HTTP server.
Port for the HTTP server.
Use gRPC server instead of HTTP.
Skip warmup phase on server startup.
API key for authentication. Clients must include this in the Authorization header.
Data Types and Quantization
Data type for model weights and activations.Options:
auto, half, float16, bfloat16, float, float32Quantization method.Options:
awq, fp8, gptq, marlin, gptq_marlin, awq_marlin, bitsandbytes, ggufKV cache data type.Options:
auto, fp8_e5m2, fp8_e4m3, bf16, bfloat16, fp4_e2m1Use FP32 precision for language model head (logits).
Memory and Scheduling
Fraction of GPU memory to allocate for static usage (model weights + KV cache).Automatically calculated based on GPU memory and configuration if not set.
Maximum number of requests being processed concurrently.
Maximum number of requests in the queue.
Maximum total tokens in KV cache pool. Limits memory usage.
Maximum tokens to process in a single prefill batch.Automatically set based on GPU memory. Larger values improve throughput but require more memory.
Enable dynamic chunking for variable-length prefill batches.
Maximum tokens allowed in a single prefill request.
Request scheduling policy. Options:
fcfs (first-come-first-serve), lpm (longest-prefix-match)Enable priority-based scheduling. Requests can specify a priority value.
Parallelism
Tensor parallelism size. Split model across this many GPUs.
Data parallelism size. Run this many independent replicas.
Pipeline parallelism size. Distribute model layers across this many stages.
Load balancing method for data parallelism.Options:
auto, round_robin, shortest_queue, follow_bootstrap_roomExpert parallelism size for MoE models.
Multi-Node
Number of nodes for distributed serving.
Rank of this node (0 to nnodes-1).
Distributed initialization address. Format:
host:portExample: 192.168.1.100:5000CUDA Graph Optimization
Maximum batch size for CUDA graph capture.Automatically set based on GPU memory. Higher values enable larger batches but use more memory.
Disable CUDA graph optimization.
Disable padding in CUDA graph batch sizes.
Cache Configuration
Disable radix attention cache (prefix caching).
Include cache hit rate statistics in API responses.
Eviction policy for radix cache. Options:
lru (least recently used), lfu (least frequently used)Attention Backend
Attention mechanism backend.Options:
flashinfer, triton, torch_native, fa3, fa4, flex_attentionAutomatically selected based on hardware if not specified.Separate attention backend for prefill phase.
Separate attention backend for decode phase.
Sampling backend. Options:
flashinfer, pytorchLoRA Adapters
Enable LoRA adapter support.
Maximum LoRA rank to support.
Paths to LoRA adapters to load at startup.
Maximum number of different LoRA adapters in a single batch.
LoRA computation backend. Options:
triton, csgmv, torch_nativeSpeculative Decoding
Speculative decoding algorithm. Options:
EAGLE, STANDALONE, NGRAMPath to draft model for speculative decoding.
Number of speculative steps.
Number of draft tokens per step.
Disaggregation
Prefill-decode disaggregation mode.Options:
null (no disaggregation), prefill (prefill server), decode (decode server)Transfer backend for PD disaggregation. Options:
mooncake, nixl, fakeInfiniBand device(s) for disaggregation. Format:
mlx5_0 or mlx5_0,mlx5_1Logging and Monitoring
Logging level. Options:
debug, info, warning, errorLog all incoming requests and responses.
Enable Prometheus metrics at
/metrics endpoint.Show time cost breakdown in responses.
Advanced Options
Random seed for reproducibility.
Token interval for streaming responses.
Watchdog timeout in seconds. Kill worker if no heartbeat.
Directory for downloading models from HuggingFace.
Starting GPU ID for multi-GPU setups.
Enable PyTorch compilation for improved performance.
Enable peer-to-peer GPU connectivity check.
Enable deterministic inference for reproducible outputs.
Model-Specific Options
Name to serve the model as. Defaults to model path.
Custom chat template (Jinja2 format).
Custom completion template.
Tool call parser for function calling. Options:
hermes, qwen, glmReasoning parser for o1-style models.
Example Configurations
Small Model (8B)
Large Model with TP (70B)
Quantized Model
Multi-Node Setup
Data Parallelism
Disaggregated Setup
Production Server
Python API
All arguments can be used when creating anEngine:
See Also
- Launch Server - Server launch guide
- Native API - Python API usage
- Sampling Parameters - Generation control
