Overview
Thesglang serve command launches a server for serving language models or diffusion models. The server type is automatically determined based on the model path, or can be explicitly specified using the --model-type flag.
Basic Usage
Required Arguments
Path or name of the model to serve. Can be:
- HuggingFace model ID (e.g.,
meta-llama/Llama-2-7b-hf) - Local path to model directory
- ModelScope model ID (when using
SGLANG_USE_MODELSCOPE=1)
Server Type Selection
Override automatic model type detection. Options:
auto: Automatically detect model type (default)llm: Force standard language model serverdiffusion: Force diffusion model server
Model Auto-Detection
SGLang automatically detects whether to launch a standard language model server or a diffusion model server based on:- For local directories: Checks for
model_index.jsonwith_diffusers_versionfield - For remote models: Attempts to download
model_index.jsonfrom HuggingFace/ModelScope - Falls back to language model server on detection failure
Language Model Server Options
Model and Tokenizer
Path to the tokenizer. Defaults to
--model-path if not specified.Tokenizer mode. Options:
auto, slow.Trust remote code from HuggingFace.
Model loading format. Options:
auto, pt, safetensors, npcache, dummy, sharded_state, gguf, bitsandbytes, layered, flash_rl, remote, remote_instance, fastsafetensors, private.Model revision (branch/tag name or commit ID).
HTTP Server
Server host address.
Server port.
API key for authentication.
Quantization
Quantization method. Options:
awq, fp8, mxfp8, gptq, marlin, gptq_marlin, awq_marlin, bitsandbytes, gguf, modelopt, modelopt_fp8, modelopt_fp4, petit_nvfp4, w8a8_int8, w8a8_fp8, moe_wna16, qoq, w4afp8, mxfp4, auto-round, compressed-tensors, modelslim, quark_int4fp8_moe.Data type for model weights. Options:
auto, float16, bfloat16, float32.Data type for KV cache. Options:
auto, fp8_e4m3, fp8_e5m2, bfloat16.Memory and Scheduling
Fraction of GPU memory to use for model weights and KV cache.
Maximum total number of tokens in the batch.
Chunk size for chunked prefill. Default varies by GPU memory (2048-16384).
Maximum number of tokens in a prefill batch.
Scheduling policy. Options:
fcfs (first-come-first-serve).Parallelism
Tensor parallelism size.
Data parallelism size.
Pipeline parallelism size.
Attention Backend
Attention backend. Options:
triton, torch_native, flex_attention, nsa, cutlass_mla, fa3, fa4, flashinfer, flashmla, trtllm_mla, trtllm_mha, dual_chunk_flash_attn, aiter, wave, intel_amx, ascend, intel_xpu.LoRA
Enable LoRA adapters.
Maximum LoRA rank.
Paths to LoRA adapters.
Speculative Decoding
Speculative decoding algorithm. Options:
EAGLE, MEDUSA, STANDALONE, NGRAM.Path to the draft model for speculative decoding.
Number of speculative decoding steps.
Logging
Logging level. Options:
debug, info, warning, error.Log all requests.
Enable Prometheus metrics.
Diffusion Model Server Options
When serving diffusion models, additional options are available:Parallelism
Number of GPUs to use.
Sequence parallelism degree.
Ulysses sequence parallelism degree.
Ring sequence parallelism degree.
Attention
Attention backend for diffusion models.
Cache-DIT configuration for diffusers.
Offloading
Offload DiT model to CPU.
Offload VAE to CPU.
Offload text encoder to CPU.
Backend
Model backend. Options:
auto, sglang, diffusers.Examples
Serve a Language Model
Serve a Diffusion Model
Advanced Configuration
Output
When the server starts successfully, you’ll see output similar to:Help
To see all available options:Related Commands
- sglang generate - Run inference on a multimodal model
- sglang version - Show version information
