Skip to main content

Overview

This page documents all available server arguments for launching SGLang. These can be passed via command-line flags or programmatically when creating an Engine.

Model and Tokenizer

model-path
string
required
Path to model weights. Can be a local folder or HuggingFace repo ID.Examples: meta-llama/Llama-3.1-8B-Instruct, /local/path/to/model
tokenizer-path
string
default:"null"
Path to tokenizer. Defaults to model-path if not specified.
tokenizer-mode
string
default:"auto"
Tokenizer mode. Options: auto (use fast tokenizer if available), slow (always use slow tokenizer).
skip-tokenizer-init
bool
default:"false"
Skip tokenizer initialization. When enabled, you must pass input_ids directly to generate.
load-format
string
default:"auto"
Model weight format to load.Options: auto, pt, safetensors, npcache, dummy, gguf, bitsandbytes, layered, remote
trust-remote-code
bool
default:"false"
Allow custom models from HuggingFace Hub with custom modeling files.
context-length
int
default:"null"
Maximum context length. Defaults to value from model’s config.json.
is-embedding
bool
default:"false"
Use a causal LM as an embedding model.
enable-multimodal
bool
default:"false"
Enable multimodal functionality for vision/audio models.
revision
string
default:"null"
Model version (branch name, tag, or commit ID) to use from HuggingFace.

HTTP Server

host
string
default:"127.0.0.1"
Host address for the HTTP server.
port
int
default:"30000"
Port for the HTTP server.
grpc-mode
bool
default:"false"
Use gRPC server instead of HTTP.
skip-server-warmup
bool
default:"false"
Skip warmup phase on server startup.
api-key
string
default:"null"
API key for authentication. Clients must include this in the Authorization header.

Data Types and Quantization

dtype
string
default:"auto"
Data type for model weights and activations.Options: auto, half, float16, bfloat16, float, float32
quantization
string
default:"null"
Quantization method.Options: awq, fp8, gptq, marlin, gptq_marlin, awq_marlin, bitsandbytes, gguf
kv-cache-dtype
string
default:"auto"
KV cache data type.Options: auto, fp8_e5m2, fp8_e4m3, bf16, bfloat16, fp4_e2m1
enable-fp32-lm-head
bool
default:"false"
Use FP32 precision for language model head (logits).

Memory and Scheduling

mem-fraction-static
float
default:"auto"
Fraction of GPU memory to allocate for static usage (model weights + KV cache).Automatically calculated based on GPU memory and configuration if not set.
max-running-requests
int
default:"null"
Maximum number of requests being processed concurrently.
max-queued-requests
int
default:"null"
Maximum number of requests in the queue.
max-total-tokens
int
default:"null"
Maximum total tokens in KV cache pool. Limits memory usage.
chunked-prefill-size
int
default:"auto"
Maximum tokens to process in a single prefill batch.Automatically set based on GPU memory. Larger values improve throughput but require more memory.
enable-dynamic-chunking
bool
default:"false"
Enable dynamic chunking for variable-length prefill batches.
max-prefill-tokens
int
default:"16384"
Maximum tokens allowed in a single prefill request.
schedule-policy
string
default:"fcfs"
Request scheduling policy. Options: fcfs (first-come-first-serve), lpm (longest-prefix-match)
enable-priority-scheduling
bool
default:"false"
Enable priority-based scheduling. Requests can specify a priority value.

Parallelism

tp-size
int
default:"1"
Tensor parallelism size. Split model across this many GPUs.
dp-size
int
default:"1"
Data parallelism size. Run this many independent replicas.
pp-size
int
default:"1"
Pipeline parallelism size. Distribute model layers across this many stages.
load-balance-method
string
default:"auto"
Load balancing method for data parallelism.Options: auto, round_robin, shortest_queue, follow_bootstrap_room
ep-size
int
default:"1"
Expert parallelism size for MoE models.

Multi-Node

nnodes
int
default:"1"
Number of nodes for distributed serving.
node-rank
int
default:"0"
Rank of this node (0 to nnodes-1).
dist-init-addr
string
default:"null"
Distributed initialization address. Format: host:portExample: 192.168.1.100:5000

CUDA Graph Optimization

cuda-graph-max-bs
int
default:"auto"
Maximum batch size for CUDA graph capture.Automatically set based on GPU memory. Higher values enable larger batches but use more memory.
disable-cuda-graph
bool
default:"false"
Disable CUDA graph optimization.
disable-cuda-graph-padding
bool
default:"false"
Disable padding in CUDA graph batch sizes.

Cache Configuration

disable-radix-cache
bool
default:"false"
Disable radix attention cache (prefix caching).
enable-cache-report
bool
default:"false"
Include cache hit rate statistics in API responses.
radix-eviction-policy
string
default:"lru"
Eviction policy for radix cache. Options: lru (least recently used), lfu (least frequently used)

Attention Backend

attention-backend
string
default:"null"
Attention mechanism backend.Options: flashinfer, triton, torch_native, fa3, fa4, flex_attentionAutomatically selected based on hardware if not specified.
prefill-attention-backend
string
default:"null"
Separate attention backend for prefill phase.
decode-attention-backend
string
default:"null"
Separate attention backend for decode phase.
sampling-backend
string
default:"null"
Sampling backend. Options: flashinfer, pytorch

LoRA Adapters

enable-lora
bool
default:"false"
Enable LoRA adapter support.
max-lora-rank
int
default:"null"
Maximum LoRA rank to support.
lora-paths
array[string]
default:"null"
Paths to LoRA adapters to load at startup.
max-loras-per-batch
int
default:"8"
Maximum number of different LoRA adapters in a single batch.
lora-backend
string
default:"csgmv"
LoRA computation backend. Options: triton, csgmv, torch_native

Speculative Decoding

speculative-algorithm
string
default:"null"
Speculative decoding algorithm. Options: EAGLE, STANDALONE, NGRAM
speculative-draft-model-path
string
default:"null"
Path to draft model for speculative decoding.
speculative-num-steps
int
default:"null"
Number of speculative steps.
speculative-num-draft-tokens
int
default:"null"
Number of draft tokens per step.

Disaggregation

disaggregation-mode
string
default:"null"
Prefill-decode disaggregation mode.Options: null (no disaggregation), prefill (prefill server), decode (decode server)
disaggregation-transfer-backend
string
default:"mooncake"
Transfer backend for PD disaggregation. Options: mooncake, nixl, fake
disaggregation-ib-device
string
default:"null"
InfiniBand device(s) for disaggregation. Format: mlx5_0 or mlx5_0,mlx5_1

Logging and Monitoring

log-level
string
default:"info"
Logging level. Options: debug, info, warning, error
log-requests
bool
default:"false"
Log all incoming requests and responses.
enable-metrics
bool
default:"false"
Enable Prometheus metrics at /metrics endpoint.
show-time-cost
bool
default:"false"
Show time cost breakdown in responses.

Advanced Options

random-seed
int
default:"null"
Random seed for reproducibility.
stream-interval
int
default:"1"
Token interval for streaming responses.
watchdog-timeout
float
default:"300"
Watchdog timeout in seconds. Kill worker if no heartbeat.
download-dir
string
default:"null"
Directory for downloading models from HuggingFace.
base-gpu-id
int
default:"0"
Starting GPU ID for multi-GPU setups.
enable-torch-compile
bool
default:"false"
Enable PyTorch compilation for improved performance.
enable-p2p-check
bool
default:"false"
Enable peer-to-peer GPU connectivity check.
enable-deterministic-inference
bool
default:"false"
Enable deterministic inference for reproducible outputs.

Model-Specific Options

served-model-name
string
default:"null"
Name to serve the model as. Defaults to model path.
chat-template
string
default:"null"
Custom chat template (Jinja2 format).
completion-template
string
default:"null"
Custom completion template.
tool-call-parser
string
default:"null"
Tool call parser for function calling. Options: hermes, qwen, glm
reasoning-parser
string
default:"null"
Reasoning parser for o1-style models.

Example Configurations

Small Model (8B)

sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --mem-fraction-static 0.8

Large Model with TP (70B)

sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 8192 \
  --cuda-graph-max-bs 256

Quantized Model

sglang serve \
  --model-path TheBloke/Llama-2-13B-AWQ \
  --quantization awq \
  --dtype half \
  --kv-cache-dtype fp8_e4m3

Multi-Node Setup

# Node 0
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 8 \
  --nnodes 2 \
  --node-rank 0 \
  --dist-init-addr 192.168.1.100:5000

# Node 1
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 8 \
  --nnodes 2 \
  --node-rank 1 \
  --dist-init-addr 192.168.1.100:5000

Data Parallelism

sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --dp-size 4 \
  --load-balance-method round_robin

Disaggregated Setup

# Prefill server
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend mooncake

# Decode server
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend mooncake

Production Server

sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --host 0.0.0.0 \
  --port 30000 \
  --api-key your-secret-key \
  --enable-metrics \
  --log-level info \
  --log-requests \
  --mem-fraction-static 0.85 \
  --cuda-graph-max-bs 256

Python API

All arguments can be used when creating an Engine:
from sglang import Engine

engine = Engine(
    model_path="meta-llama/Llama-3.1-70B-Instruct",
    tp_size=4,
    mem_fraction_static=0.85,
    trust_remote_code=True,
    dtype="bfloat16",
    log_level="info"
)

See Also