Skip to main content

Overview

SGLang provides a high-performance inference server that can be launched using the sglang serve command. The server supports various deployment modes including HTTP, gRPC, and disaggregated prefill-decode architectures.

Basic Usage

Starting the Server

The simplest way to launch a server:
sglang serve --model-path meta-llama/Llama-3.1-8B-Instruct

Common Launch Options

sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --tp-size 2 \
  --mem-fraction-static 0.8

Server Modes

HTTP Server (Default)

The default server mode provides OpenAI-compatible HTTP endpoints:
sglang serve --model-path meta-llama/Llama-3.1-8B-Instruct
Available Endpoints:
  • /v1/chat/completions - Chat completions API
  • /v1/completions - Text completions API
  • /v1/embeddings - Embeddings generation
  • /health - Health check endpoint
  • /get_model_info - Model information

gRPC Server

For lower latency in high-throughput scenarios:
sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --grpc-mode

Disaggregated Prefill-Decode

SGLang supports separating prefill and decode into different instances for optimized resource utilization. Prefill Server:
sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend mooncake
Decode Server:
sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend mooncake

Parallelism Options

Tensor Parallelism

Split model across multiple GPUs:
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 4

Pipeline Parallelism

Distribute model layers across GPUs:
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 2 \
  --pp-size 2

Data Parallelism

Run multiple replicas for increased throughput:
sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --dp-size 4 \
  --load-balance-method round_robin

Performance Optimization

Memory Management

mem-fraction-static
float
default:"auto"
Fraction of GPU memory to allocate for static usage (model weights + KV cache). Default is automatically calculated based on GPU memory and model size.
chunked-prefill-size
int
default:"auto"
Maximum number of tokens to process in a single prefill batch. Larger values improve throughput but require more memory.
max-total-tokens
int
default:"null"
Maximum total number of tokens in the KV cache pool. Limits memory usage.

CUDA Graph Optimization

CUDA graphs reduce kernel launch overhead:
sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --cuda-graph-max-bs 256
cuda-graph-max-bs
int
default:"auto"
Maximum batch size for CUDA graph capture. Higher values enable batching more requests but require more memory. Set to 0 to disable.
disable-cuda-graph
bool
default:"false"
Disable CUDA graph optimization entirely.

Radix Attention Cache

Accelerate requests with shared prefixes:
sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --enable-cache-report
disable-radix-cache
bool
default:"false"
Disable the radix attention cache (prefix caching).
enable-cache-report
bool
default:"false"
Include cache hit rate statistics in API responses.

Multi-Node Deployment

For distributed training across multiple machines:
# Node 0 (rank 0)
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 8 \
  --nnodes 2 \
  --node-rank 0 \
  --dist-init-addr 192.168.1.100:5000

# Node 1 (rank 1)
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp-size 8 \
  --nnodes 2 \
  --node-rank 1 \
  --dist-init-addr 192.168.1.100:5000

Quantization

Reduce memory usage with quantization:
# FP8 Quantization
sglang serve \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --quantization fp8

# AWQ 4-bit Quantization
sglang serve \
  --model-path TheBloke/Llama-2-13B-AWQ \
  --quantization awq

# GPTQ Quantization
sglang serve \
  --model-path TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq
Supported quantization methods:
  • fp8 - FP8 quantization for reduced memory
  • awq - Activation-aware Weight Quantization
  • gptq - GPTQ quantization
  • marlin - Marlin sparse format
  • bitsandbytes - 8-bit and 4-bit quantization

Monitoring and Logging

Enable Metrics

sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --enable-metrics
Metrics are exposed at http://localhost:30000/metrics in Prometheus format.

Request Logging

sglang serve \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --log-requests \
  --log-level info
log-requests
bool
default:"false"
Log all incoming requests and responses.
log-level
string
default:"info"
Set logging verbosity. Options: debug, info, warning, error.

Health Checks and Warmup

Server Warmup

By default, the server runs warmup requests to initialize CUDA graphs and caches:
skip-server-warmup
bool
default:"false"
Skip the warmup phase on server startup.
warmups
string
default:"null"
Specify custom warmup functions (comma-separated) to run before server starts. Example: --warmups=warmup_name1,warmup_name2

Health Endpoint

Check server health:
curl http://localhost:30000/health

Environment Variables

SGLang respects several environment variables:
  • CUDA_VISIBLE_DEVICES - Control which GPUs are used
  • NCCL_SOCKET_IFNAME - Network interface for multi-node communication
  • SGLANG_USE_MODELSCOPE - Download models from ModelScope instead of HuggingFace
  • HF_TOKEN - HuggingFace authentication token for gated models

Python API

You can also launch the server programmatically:
from sglang import Engine

engine = Engine(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=2,
    mem_fraction_static=0.8
)

# Use the engine
response = engine.generate(
    prompt="Hello, how are you?",
    sampling_params={"temperature": 0.8, "max_new_tokens": 128}
)

print(response["text"])

See Also