Server Arguments

Overview

This page documents all available server arguments for launching SGLang. These can be passed via command-line flags or programmatically when creating an Engine.

Model and Tokenizer

model-path

string

required

Path to model weights. Can be a local folder or HuggingFace repo ID.Examples: meta-llama/Llama-3.1-8B-Instruct, /local/path/to/model

tokenizer-path

string

default:"null"

Path to tokenizer. Defaults to model-path if not specified.

tokenizer-mode

string

default:"auto"

Tokenizer mode. Options: auto (use fast tokenizer if available), slow (always use slow tokenizer).

skip-tokenizer-init

bool

default:"false"

Skip tokenizer initialization. When enabled, you must pass input_ids directly to generate.

load-format

string

default:"auto"

Model weight format to load.Options: auto, pt, safetensors, npcache, dummy, gguf, bitsandbytes, layered, remote

trust-remote-code

bool

default:"false"

Allow custom models from HuggingFace Hub with custom modeling files.

context-length

int

default:"null"

Maximum context length. Defaults to value from model’s config.json.

is-embedding

bool

default:"false"

Use a causal LM as an embedding model.

enable-multimodal

bool

default:"false"

Enable multimodal functionality for vision/audio models.

revision

string

default:"null"

Model version (branch name, tag, or commit ID) to use from HuggingFace.

HTTP Server

host

string

default:"127.0.0.1"

Host address for the HTTP server.

port

int

default:"30000"

Port for the HTTP server.

grpc-mode

bool

default:"false"

Use gRPC server instead of HTTP.

skip-server-warmup

bool

default:"false"

Skip warmup phase on server startup.

api-key

string

default:"null"

API key for authentication. Clients must include this in the Authorization header.

Data Types and Quantization

dtype

string

default:"auto"

Data type for model weights and activations.Options: auto, half, float16, bfloat16, float, float32

quantization

string

default:"null"

Quantization method.Options: awq, fp8, gptq, marlin, gptq_marlin, awq_marlin, bitsandbytes, gguf

kv-cache-dtype

string

default:"auto"

KV cache data type.Options: auto, fp8_e5m2, fp8_e4m3, bf16, bfloat16, fp4_e2m1

enable-fp32-lm-head

bool

default:"false"

Use FP32 precision for language model head (logits).

Memory and Scheduling

mem-fraction-static

float

default:"auto"

Fraction of GPU memory to allocate for static usage (model weights + KV cache).Automatically calculated based on GPU memory and configuration if not set.

max-running-requests

int

default:"null"

Maximum number of requests being processed concurrently.

max-queued-requests

int

default:"null"

Maximum number of requests in the queue.

max-total-tokens

int

default:"null"

Maximum total tokens in KV cache pool. Limits memory usage.

chunked-prefill-size

int

default:"auto"

Maximum tokens to process in a single prefill batch.Automatically set based on GPU memory. Larger values improve throughput but require more memory.

enable-dynamic-chunking

bool

default:"false"

Enable dynamic chunking for variable-length prefill batches.

max-prefill-tokens

int

default:"16384"

Maximum tokens allowed in a single prefill request.

schedule-policy

string

default:"fcfs"

Request scheduling policy. Options: fcfs (first-come-first-serve), lpm (longest-prefix-match)

enable-priority-scheduling

bool

default:"false"

Enable priority-based scheduling. Requests can specify a priority value.

Parallelism

tp-size

int

default:"1"

Tensor parallelism size. Split model across this many GPUs.

dp-size

int

default:"1"

Data parallelism size. Run this many independent replicas.

pp-size

int

default:"1"

Pipeline parallelism size. Distribute model layers across this many stages.

load-balance-method

string

default:"auto"

Load balancing method for data parallelism.Options: auto, round_robin, shortest_queue, follow_bootstrap_room

ep-size

int

default:"1"

Expert parallelism size for MoE models.

Multi-Node

nnodes

int

default:"1"

Number of nodes for distributed serving.

node-rank

int

default:"0"

Rank of this node (0 to nnodes-1).

dist-init-addr

string

default:"null"

Distributed initialization address. Format: host:portExample: 192.168.1.100:5000

CUDA Graph Optimization

cuda-graph-max-bs

int

default:"auto"

Maximum batch size for CUDA graph capture.Automatically set based on GPU memory. Higher values enable larger batches but use more memory.

disable-cuda-graph

bool

default:"false"

Disable CUDA graph optimization.

disable-cuda-graph-padding

bool

default:"false"

Disable padding in CUDA graph batch sizes.

Cache Configuration

disable-radix-cache

bool

default:"false"

Disable radix attention cache (prefix caching).

enable-cache-report

bool

default:"false"

Include cache hit rate statistics in API responses.

radix-eviction-policy

string

default:"lru"

Eviction policy for radix cache. Options: lru (least recently used), lfu (least frequently used)

Attention Backend

attention-backend

string

default:"null"

Attention mechanism backend.Options: flashinfer, triton, torch_native, fa3, fa4, flex_attentionAutomatically selected based on hardware if not specified.

prefill-attention-backend

string

default:"null"

Separate attention backend for prefill phase.

decode-attention-backend

string

default:"null"

Separate attention backend for decode phase.

sampling-backend

string

default:"null"

Sampling backend. Options: flashinfer, pytorch

LoRA Adapters

enable-lora

bool

default:"false"

Enable LoRA adapter support.

max-lora-rank

int

default:"null"

Maximum LoRA rank to support.

lora-paths

array[string]

default:"null"

Paths to LoRA adapters to load at startup.

max-loras-per-batch

int

default:"8"

Maximum number of different LoRA adapters in a single batch.

lora-backend

string

default:"csgmv"

LoRA computation backend. Options: triton, csgmv, torch_native

Speculative Decoding

speculative-algorithm

string

default:"null"

Speculative decoding algorithm. Options: EAGLE, STANDALONE, NGRAM

speculative-draft-model-path

string

default:"null"

Path to draft model for speculative decoding.

speculative-num-steps

int

default:"null"

Number of speculative steps.

speculative-num-draft-tokens

int

default:"null"

Number of draft tokens per step.

Disaggregation

disaggregation-mode

string

default:"null"

Prefill-decode disaggregation mode.Options: null (no disaggregation), prefill (prefill server), decode (decode server)

disaggregation-transfer-backend

string

default:"mooncake"

Transfer backend for PD disaggregation. Options: mooncake, nixl, fake

disaggregation-ib-device

string

default:"null"

InfiniBand device(s) for disaggregation. Format: mlx5_0 or mlx5_0,mlx5_1

Logging and Monitoring

log-level

string

default:"info"

Logging level. Options: debug, info, warning, error

log-requests

bool

default:"false"

Log all incoming requests and responses.

enable-metrics

bool

default:"false"

Enable Prometheus metrics at /metrics endpoint.

show-time-cost

bool

default:"false"

Show time cost breakdown in responses.

Advanced Options

random-seed

int

default:"null"

Random seed for reproducibility.

stream-interval

int

default:"1"

Token interval for streaming responses.

watchdog-timeout

float

default:"300"

Watchdog timeout in seconds. Kill worker if no heartbeat.

download-dir

string

default:"null"

Directory for downloading models from HuggingFace.

base-gpu-id

int

default:"0"

Starting GPU ID for multi-GPU setups.

enable-torch-compile

bool

default:"false"

Enable PyTorch compilation for improved performance.

enable-p2p-check

bool

default:"false"

Enable peer-to-peer GPU connectivity check.

enable-deterministic-inference

bool

default:"false"

Enable deterministic inference for reproducible outputs.

Model-Specific Options

served-model-name

string

default:"null"

Name to serve the model as. Defaults to model path.

chat-template

string

default:"null"

Custom chat template (Jinja2 format).

completion-template

string

default:"null"

Custom completion template.

tool-call-parser

string

default:"null"

Tool call parser for function calling. Options: hermes, qwen, glm

reasoning-parser

string

default:"null"

Reasoning parser for o1-style models.

Example Configurations

Small Model (8B)

sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --mem-fraction-static 0.8

Large Model with TP (70B)

sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 8192 \
  --cuda-graph-max-bs 256

Quantized Model

sglang serve \
  --model-path TheBloke/Llama-2-13B-AWQ \
  --quantization awq \
  --dtype half \
  --kv-cache-dtype fp8_e4m3

Multi-Node Setup

# Node 0
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 8 \
  --nnodes 2 \
  --node-rank 0 \
  --dist-init-addr 192.168.1.100:5000

# Node 1
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 8 \
  --nnodes 2 \
  --node-rank 1 \
  --dist-init-addr 192.168.1.100:5000

Data Parallelism

sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --dp-size 4 \
  --load-balance-method round_robin

Disaggregated Setup

# Prefill server
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend mooncake

# Decode server
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend mooncake

Production Server

sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4 \
  --host 0.0.0.0 \
  --port 30000 \
  --api-key your-secret-key \
  --enable-metrics \
  --log-level info \
  --log-requests \
  --mem-fraction-static 0.85 \
  --cuda-graph-max-bs 256

Python API

All arguments can be used when creating an Engine:

from sglang import Engine

engine = Engine(
    model_path="meta-llama/Llama-3.1-70B-Instruct",
    tp_size=4,
    mem_fraction_static=0.85,
    trust_remote_code=True,
    dtype="bfloat16",
    log_level="info"
)

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

Server Arguments

Overview

Model and Tokenizer

HTTP Server

Data Types and Quantization

Memory and Scheduling

Parallelism

Multi-Node

CUDA Graph Optimization

Cache Configuration

Attention Backend

LoRA Adapters

Speculative Decoding

Disaggregation

Logging and Monitoring

Advanced Options

Model-Specific Options

Example Configurations

Small Model (8B)

Large Model with TP (70B)

Quantized Model

Multi-Node Setup

Data Parallelism

Disaggregated Setup

Production Server

Python API

See Also

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Model and Tokenizer

​HTTP Server

​Data Types and Quantization

​Memory and Scheduling

​Parallelism

​Multi-Node

​CUDA Graph Optimization

​Cache Configuration

​Attention Backend

​LoRA Adapters

​Speculative Decoding

​Disaggregation

​Logging and Monitoring

​Advanced Options

​Model-Specific Options

​Example Configurations

​Small Model (8B)

​Large Model with TP (70B)

​Quantized Model

​Multi-Node Setup

​Data Parallelism

​Disaggregated Setup

​Production Server

​Python API

​See Also

Overview

Model and Tokenizer

HTTP Server

Data Types and Quantization

Memory and Scheduling

Parallelism

Multi-Node

CUDA Graph Optimization

Cache Configuration

Attention Backend

LoRA Adapters

Speculative Decoding

Disaggregation

Logging and Monitoring

Advanced Options

Model-Specific Options

Example Configurations

Small Model (8B)

Large Model with TP (70B)

Quantized Model

Multi-Node Setup

Data Parallelism

Disaggregated Setup

Production Server

Python API

See Also