Skip to main content

Overview

The sglang serve command launches a server for serving language models or diffusion models. The server type is automatically determined based on the model path, or can be explicitly specified using the --model-type flag.

Basic Usage

sglang serve --model-path <model-name-or-path> [options]

Required Arguments

--model-path
string
required
Path or name of the model to serve. Can be:
  • HuggingFace model ID (e.g., meta-llama/Llama-2-7b-hf)
  • Local path to model directory
  • ModelScope model ID (when using SGLANG_USE_MODELSCOPE=1)

Server Type Selection

--model-type
string
default:"auto"
Override automatic model type detection. Options:
  • auto: Automatically detect model type (default)
  • llm: Force standard language model server
  • diffusion: Force diffusion model server

Model Auto-Detection

SGLang automatically detects whether to launch a standard language model server or a diffusion model server based on:
  1. For local directories: Checks for model_index.json with _diffusers_version field
  2. For remote models: Attempts to download model_index.json from HuggingFace/ModelScope
  3. Falls back to language model server on detection failure

Language Model Server Options

Model and Tokenizer

--tokenizer-path
string
Path to the tokenizer. Defaults to --model-path if not specified.
--tokenizer-mode
string
default:"auto"
Tokenizer mode. Options: auto, slow.
--trust-remote-code
boolean
default:"false"
Trust remote code from HuggingFace.
--load-format
string
default:"auto"
Model loading format. Options: auto, pt, safetensors, npcache, dummy, sharded_state, gguf, bitsandbytes, layered, flash_rl, remote, remote_instance, fastsafetensors, private.
--revision
string
Model revision (branch/tag name or commit ID).

HTTP Server

--host
string
default:"127.0.0.1"
Server host address.
--port
integer
default:"30000"
Server port.
--api-key
string
API key for authentication.

Quantization

--quantization
string
Quantization method. Options: awq, fp8, mxfp8, gptq, marlin, gptq_marlin, awq_marlin, bitsandbytes, gguf, modelopt, modelopt_fp8, modelopt_fp4, petit_nvfp4, w8a8_int8, w8a8_fp8, moe_wna16, qoq, w4afp8, mxfp4, auto-round, compressed-tensors, modelslim, quark_int4fp8_moe.
--dtype
string
default:"auto"
Data type for model weights. Options: auto, float16, bfloat16, float32.
--kv-cache-dtype
string
default:"auto"
Data type for KV cache. Options: auto, fp8_e4m3, fp8_e5m2, bfloat16.

Memory and Scheduling

--mem-fraction-static
float
Fraction of GPU memory to use for model weights and KV cache.
--max-total-tokens
integer
Maximum total number of tokens in the batch.
--chunked-prefill-size
integer
Chunk size for chunked prefill. Default varies by GPU memory (2048-16384).
--max-prefill-tokens
integer
default:"16384"
Maximum number of tokens in a prefill batch.
--schedule-policy
string
default:"fcfs"
Scheduling policy. Options: fcfs (first-come-first-serve).

Parallelism

--tp-size
integer
default:"1"
Tensor parallelism size.
--dp-size
integer
default:"1"
Data parallelism size.
--pp-size
integer
default:"1"
Pipeline parallelism size.

Attention Backend

--attention-backend
string
Attention backend. Options: triton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend, intel_xpu.

LoRA

--enable-lora
boolean
Enable LoRA adapters.
--max-lora-rank
integer
Maximum LoRA rank.
--lora-paths
string
Paths to LoRA adapters.

Speculative Decoding

--speculative-algorithm
string
Speculative decoding algorithm. Options: EAGLE, MEDUSA, STANDALONE, NGRAM.
--speculative-draft-model-path
string
Path to the draft model for speculative decoding.
--speculative-num-steps
integer
Number of speculative decoding steps.

Logging

--log-level
string
default:"info"
Logging level. Options: debug, info, warning, error.
--log-requests
boolean
default:"false"
Log all requests.
--enable-metrics
boolean
default:"false"
Enable Prometheus metrics.

Diffusion Model Server Options

When serving diffusion models, additional options are available:

Parallelism

--num-gpus
integer
default:"1"
Number of GPUs to use.
--sp-degree
integer
Sequence parallelism degree.
--ulysses-degree
integer
Ulysses sequence parallelism degree.
--ring-degree
integer
Ring sequence parallelism degree.

Attention

--attention-backend
string
Attention backend for diffusion models.
--cache-dit-config
string
Cache-DIT configuration for diffusers.

Offloading

--dit-cpu-offload
boolean
Offload DiT model to CPU.
--vae-cpu-offload
boolean
Offload VAE to CPU.
--text-encoder-cpu-offload
boolean
Offload text encoder to CPU.

Backend

--backend
string
default:"auto"
Model backend. Options: auto, sglang, diffusers.

Examples

Serve a Language Model

# Basic usage
sglang serve --model-path meta-llama/Llama-2-7b-hf

# With custom port and tensor parallelism
sglang serve --model-path meta-llama/Llama-2-70b-hf \
  --port 8080 \
  --tp-size 4

# With quantization
sglang serve --model-path meta-llama/Llama-2-7b-hf \
  --quantization awq

# With LoRA adapters
sglang serve --model-path meta-llama/Llama-2-7b-hf \
  --enable-lora \
  --max-lora-rank 64

Serve a Diffusion Model

# Basic diffusion model serving
sglang serve --model-path stabilityai/stable-diffusion-xl-base-1.0

# With custom parallelism
sglang serve --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --num-gpus 4 \
  --sp-degree 2

# Force diffusion model type
sglang serve --model-path custom/model \
  --model-type diffusion

Advanced Configuration

# High-performance setup with chunked prefill and CUDA graphs
sglang serve --model-path meta-llama/Llama-2-70b-hf \
  --tp-size 8 \
  --chunked-prefill-size 8192 \
  --max-total-tokens 32768 \
  --enable-metrics

# With speculative decoding
sglang serve --model-path meta-llama/Llama-2-70b-hf \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path meta-llama/Llama-2-7b-hf \
  --speculative-num-steps 5

Output

When the server starts successfully, you’ll see output similar to:
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)

Help

To see all available options:
sglang serve --help
Note: Since the exact help depends on the model type, provide a model path to see specific options:
sglang serve --model-path meta-llama/Llama-2-7b-hf --help