Overview
SGLang provides a high-performance inference server that can be launched using thesglang serve command. The server supports various deployment modes including HTTP, gRPC, and disaggregated prefill-decode architectures.
Basic Usage
Starting the Server
The simplest way to launch a server:Common Launch Options
Server Modes
HTTP Server (Default)
The default server mode provides OpenAI-compatible HTTP endpoints:/v1/chat/completions- Chat completions API/v1/completions- Text completions API/v1/embeddings- Embeddings generation/health- Health check endpoint/get_model_info- Model information
gRPC Server
For lower latency in high-throughput scenarios:Disaggregated Prefill-Decode
SGLang supports separating prefill and decode into different instances for optimized resource utilization. Prefill Server:Parallelism Options
Tensor Parallelism
Split model across multiple GPUs:Pipeline Parallelism
Distribute model layers across GPUs:Data Parallelism
Run multiple replicas for increased throughput:Performance Optimization
Memory Management
Fraction of GPU memory to allocate for static usage (model weights + KV cache).
Default is automatically calculated based on GPU memory and model size.
Maximum number of tokens to process in a single prefill batch. Larger values
improve throughput but require more memory.
Maximum total number of tokens in the KV cache pool. Limits memory usage.
CUDA Graph Optimization
CUDA graphs reduce kernel launch overhead:Maximum batch size for CUDA graph capture. Higher values enable batching
more requests but require more memory. Set to 0 to disable.
Disable CUDA graph optimization entirely.
Radix Attention Cache
Accelerate requests with shared prefixes:Disable the radix attention cache (prefix caching).
Include cache hit rate statistics in API responses.
Multi-Node Deployment
For distributed training across multiple machines:Quantization
Reduce memory usage with quantization:fp8- FP8 quantization for reduced memoryawq- Activation-aware Weight Quantizationgptq- GPTQ quantizationmarlin- Marlin sparse formatbitsandbytes- 8-bit and 4-bit quantization
Monitoring and Logging
Enable Metrics
http://localhost:30000/metrics in Prometheus format.
Request Logging
Log all incoming requests and responses.
Set logging verbosity. Options:
debug, info, warning, error.Health Checks and Warmup
Server Warmup
By default, the server runs warmup requests to initialize CUDA graphs and caches:Skip the warmup phase on server startup.
Specify custom warmup functions (comma-separated) to run before server starts.
Example:
--warmups=warmup_name1,warmup_name2Health Endpoint
Check server health:Environment Variables
SGLang respects several environment variables:CUDA_VISIBLE_DEVICES- Control which GPUs are usedNCCL_SOCKET_IFNAME- Network interface for multi-node communicationSGLANG_USE_MODELSCOPE- Download models from ModelScope instead of HuggingFaceHF_TOKEN- HuggingFace authentication token for gated models
Python API
You can also launch the server programmatically:See Also
- Server Arguments - Complete reference of all server arguments
- OpenAI Compatible API - HTTP API documentation
- Sampling Parameters - Control generation behavior
