sglang serve

Overview

The sglang serve command launches a server for serving language models or diffusion models. The server type is automatically determined based on the model path, or can be explicitly specified using the --model-type flag.

Basic Usage

sglang serve --model-path <model-name-or-path> [options]

Required Arguments

--model-path

string

required

Path or name of the model to serve. Can be:

HuggingFace model ID (e.g., meta-llama/Llama-2-7b-hf)
Local path to model directory
ModelScope model ID (when using SGLANG_USE_MODELSCOPE=1)

Server Type Selection

--model-type

string

default:"auto"

Override automatic model type detection. Options:

auto: Automatically detect model type (default)
llm: Force standard language model server
diffusion: Force diffusion model server

Model Auto-Detection

SGLang automatically detects whether to launch a standard language model server or a diffusion model server based on:

For local directories: Checks for model_index.json with _diffusers_version field
For remote models: Attempts to download model_index.json from HuggingFace/ModelScope
Falls back to language model server on detection failure

Language Model Server Options

Model and Tokenizer

--tokenizer-path

string

Path to the tokenizer. Defaults to --model-path if not specified.

--tokenizer-mode

string

default:"auto"

Tokenizer mode. Options: auto, slow.

--trust-remote-code

boolean

default:"false"

Trust remote code from HuggingFace.

--load-format

string

default:"auto"

Model loading format. Options: auto, pt, safetensors, npcache, dummy, sharded_state, gguf, bitsandbytes, layered, flash_rl, remote, remote_instance, fastsafetensors, private.

--revision

string

Model revision (branch/tag name or commit ID).

HTTP Server

--host

string

default:"127.0.0.1"

Server host address.

--port

integer

default:"30000"

Server port.

--api-key

string

API key for authentication.

Quantization

--quantization

string

Quantization method. Options: awq, fp8, mxfp8, gptq, marlin, gptq_marlin, awq_marlin, bitsandbytes, gguf, modelopt, modelopt_fp8, modelopt_fp4, petit_nvfp4, w8a8_int8, w8a8_fp8, moe_wna16, qoq, w4afp8, mxfp4, auto-round, compressed-tensors, modelslim, quark_int4fp8_moe.

--dtype

string

default:"auto"

Data type for model weights. Options: auto, float16, bfloat16, float32.

--kv-cache-dtype

string

default:"auto"

Data type for KV cache. Options: auto, fp8_e4m3, fp8_e5m2, bfloat16.

Memory and Scheduling

--mem-fraction-static

float

Fraction of GPU memory to use for model weights and KV cache.

--max-total-tokens

integer

Maximum total number of tokens in the batch.

--chunked-prefill-size

integer

Chunk size for chunked prefill. Default varies by GPU memory (2048-16384).

--max-prefill-tokens

integer

default:"16384"

Maximum number of tokens in a prefill batch.

--schedule-policy

string

default:"fcfs"

Scheduling policy. Options: fcfs (first-come-first-serve).

Parallelism

--tp-size

integer

default:"1"

Tensor parallelism size.

--dp-size

integer

default:"1"

Data parallelism size.

--pp-size

integer

default:"1"

Pipeline parallelism size.

Attention Backend

--attention-backend

string

Attention backend. Options: triton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend, intel_xpu.

LoRA

--enable-lora

boolean

Enable LoRA adapters.

--max-lora-rank

integer

Maximum LoRA rank.

--lora-paths

string

Paths to LoRA adapters.

Speculative Decoding

--speculative-algorithm

string

Speculative decoding algorithm. Options: EAGLE, MEDUSA, STANDALONE, NGRAM.

--speculative-draft-model-path

string

Path to the draft model for speculative decoding.

--speculative-num-steps

integer

Number of speculative decoding steps.

Logging

--log-level

string

default:"info"

Logging level. Options: debug, info, warning, error.

--log-requests

boolean

default:"false"

Log all requests.

--enable-metrics

boolean

default:"false"

Enable Prometheus metrics.

Diffusion Model Server Options

When serving diffusion models, additional options are available:

Parallelism

--num-gpus

integer

default:"1"

Number of GPUs to use.

--sp-degree

integer

Sequence parallelism degree.

--ulysses-degree

integer

Ulysses sequence parallelism degree.

--ring-degree

integer

Ring sequence parallelism degree.

Attention

--attention-backend

string

Attention backend for diffusion models.

--cache-dit-config

string

Cache-DIT configuration for diffusers.

Offloading

--dit-cpu-offload

boolean

Offload DiT model to CPU.

--vae-cpu-offload

boolean

Offload VAE to CPU.

--text-encoder-cpu-offload

boolean

Offload text encoder to CPU.

Backend

--backend

string

default:"auto"

Model backend. Options: auto, sglang, diffusers.

Examples

Serve a Language Model

# Basic usage
sglang serve --model-path meta-llama/Llama-2-7b-hf

# With custom port and tensor parallelism
sglang serve --model-path meta-llama/Llama-2-70b-hf \
  --port 8080 \
  --tp-size 4

# With quantization
sglang serve --model-path meta-llama/Llama-2-7b-hf \
  --quantization awq

# With LoRA adapters
sglang serve --model-path meta-llama/Llama-2-7b-hf \
  --enable-lora \
  --max-lora-rank 64

Serve a Diffusion Model

# Basic diffusion model serving
sglang serve --model-path stabilityai/stable-diffusion-xl-base-1.0

# With custom parallelism
sglang serve --model-path stabilityai/stable-diffusion-xl-base-1.0 \
  --num-gpus 4 \
  --sp-degree 2

# Force diffusion model type
sglang serve --model-path custom/model \
  --model-type diffusion

Advanced Configuration

# High-performance setup with chunked prefill and CUDA graphs
sglang serve --model-path meta-llama/Llama-2-70b-hf \
  --tp-size 8 \
  --chunked-prefill-size 8192 \
  --max-total-tokens 32768 \
  --enable-metrics

# With speculative decoding
sglang serve --model-path meta-llama/Llama-2-70b-hf \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path meta-llama/Llama-2-7b-hf \
  --speculative-num-steps 5

Output

When the server starts successfully, you’ll see output similar to:

INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)

Help

To see all available options:

sglang serve --help

Note: Since the exact help depends on the model type, provide a model path to see specific options:

sglang serve --model-path meta-llama/Llama-2-7b-hf --help

sglang generate - Run inference on a multimodal model
sglang version - Show version information

Python API

Frontend API

HTTP API

CLI Reference

Overview

Basic Usage

Required Arguments

Server Type Selection

Model Auto-Detection

Language Model Server Options

Model and Tokenizer

HTTP Server

Quantization

Memory and Scheduling

Parallelism

Attention Backend

LoRA

Speculative Decoding

Logging

Diffusion Model Server Options

Parallelism

Attention

Offloading

Backend

Examples

Serve a Language Model

Serve a Diffusion Model

Advanced Configuration

Output

Help

Python API

Frontend API

HTTP API

CLI Reference

​Overview

​Basic Usage

​Required Arguments

​Server Type Selection

​Model Auto-Detection

​Language Model Server Options

​Model and Tokenizer

​HTTP Server

​Quantization

​Memory and Scheduling

​Parallelism

​Attention Backend

​LoRA

​Speculative Decoding

​Logging

​Diffusion Model Server Options

​Parallelism

​Attention

​Offloading

​Backend

​Examples

​Serve a Language Model

​Serve a Diffusion Model

​Advanced Configuration

​Output

​Help

​Related Commands

Overview

Basic Usage

Required Arguments

Server Type Selection

Model Auto-Detection

Language Model Server Options

Model and Tokenizer

HTTP Server

Quantization

Memory and Scheduling

Parallelism

Attention Backend

LoRA

Speculative Decoding

Logging

Diffusion Model Server Options

Parallelism

Attention

Offloading

Backend

Examples

Serve a Language Model

Serve a Diffusion Model

Advanced Configuration

Output

Help

Related Commands