This guide helps you migrate between major versions of SGLang and understand breaking changes.
Overview
SGLang follows semantic versioning (MAJOR.MINOR.PATCH):
- Major versions: Breaking changes that require code modifications
- Minor versions: New features with backward compatibility
- Patch versions: Bug fixes with backward compatibility
Migrating to v0.5.x
Environment Variables
Several environment variables have been deprecated in favor of CLI flags:
These environment variables will be removed in v0.5.7+. Migrate to CLI flags.
| Deprecated Env Var | Replacement CLI Flag |
|---|
SGLANG_ENABLE_FLASHINFER_FP8_GEMM | --fp8-gemm-backend=flashinfer_trtllm |
SGLANG_ENABLE_FLASHINFER_GEMM | --fp8-gemm-backend=flashinfer_trtllm |
SGLANG_SUPPORT_CUTLASS_BLOCK_FP8 | --fp8-gemm-backend=cutlass |
SGLANG_FLASHINFER_FP4_GEMM_BACKEND | --fp4-gemm-backend |
SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE | --enable-prefill-delayer |
SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES | --prefill-delayer-max-delay-passes |
SGLANG_PREFILL_DELAYER_TOKEN_USAGE_LOW_WATERMARK | --prefill-delayer-token-usage-low-watermark |
Before:
export SGLANG_ENABLE_FLASHINFER_FP8_GEMM=true
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
After:
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--fp8-gemm-backend flashinfer_trtllm
Timeout Configuration
Timeout environment variables have changed from milliseconds to seconds:
| Old (milliseconds) | New (seconds) |
|---|
SGLANG_QUEUED_TIMEOUT_MS | SGLANG_REQ_WAITING_TIMEOUT |
SGLANG_FORWARD_TIMEOUT_MS | SGLANG_REQ_RUNNING_TIMEOUT |
Before:
export SGLANG_QUEUED_TIMEOUT_MS=300000 # 5 minutes in ms
After:
export SGLANG_REQ_WAITING_TIMEOUT=300 # 5 minutes in seconds
Prefix Migration: SGL_ to SGLANG_
All SGL_ prefixed environment variables are deprecated in favor of SGLANG_:
Before:
export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=true
After:
export SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK=false
The old SGL_ prefix still works but will show deprecation warnings.
Migrating to v0.4.x
Deterministic Inference
A new deterministic inference mode was introduced. If you need reproducible results:
Before (v0.3.x):
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disable-radix-cache
After (v0.4.x):
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--enable-deterministic-inference
See the blog post for details.
MoE Backend Changes
The SGLANG_CUTLASS_MOE environment variable is deprecated:
Before:
export SGLANG_CUTLASS_MOE=true
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3
After:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--moe-runner-backend cutlass
Migrating from Other Frameworks
From vLLM
SGLang provides a similar API to vLLM with enhanced performance:
vLLM:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(
["Tell me a joke"],
SamplingParams(temperature=0.7, max_tokens=100)
)
SGLang:
import sglang as sgl
llm = sgl.Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(
["Tell me a joke"],
sgl.SamplingParams(temperature=0.7, max_tokens=100)
)
Key Differences from vLLM
- Prefix Caching: SGLang uses RadixAttention by default (more efficient)
- Chunked Prefill: Different default chunk sizes
- Memory Management: Different memory fraction defaults
- API Compatibility: SGLang is OpenAI-compatible but has additional features
From Text Generation Inference (TGI)
TGI uses a Docker-based approach, while SGLang can run directly:
TGI:
docker run --gpus all \
-p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct
SGLang:
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 8080
From LiteLLM
LiteLLM is a proxy/router, while SGLang is an inference engine. You can use LiteLLM with SGLang:
import litellm
# Point LiteLLM to SGLang endpoint
response = litellm.completion(
model="openai/meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:30000/v1"
)
Breaking Changes by Version
v0.5.0
- Environment variable prefix changes (
SGL_ → SGLANG_)
- Timeout units changed from milliseconds to seconds
- Several FP8/quantization env vars deprecated for CLI flags
- Memory pool configuration changes
v0.4.0
- Introduction of deterministic inference mode
- MoE backend configuration moved to CLI flags
- FlashInfer becomes the default attention backend
- Changes to RadixAttention cache behavior
v0.3.0
- Initial support for DeepSeek V3
- New multi-node deployment options
- Changes to expert parallelism configuration
Best Practices for Migration
1. Test in Staging First
Always test new versions in a staging environment before production deployment.
2. Review Deprecation Warnings
Pay attention to deprecation warnings in logs:
python -m sglang.launch_server --model-path YOUR_MODEL 2>&1 | grep -i "deprecat"
3. Pin Versions in Production
Use specific versions in your requirements:
sglang==0.5.6 # Not sglang>=0.5.0
4. Check Release Notes
Always review release notes before upgrading.
5. Update Configuration Files
If you use configuration files, update them according to the new format:
# config.py - Before
config = {
"env": {
"SGLANG_ENABLE_FLASHINFER_FP8_GEMM": "true"
}
}
# config.py - After
config = {
"args": [
"--fp8-gemm-backend", "flashinfer_trtllm"
]
}
After migration, monitor key metrics:
- Throughput (requests/second)
- Latency (p50, p95, p99)
- GPU memory usage
- Error rates
See Observability for monitoring setup.
Backward Compatibility
SGLang maintains backward compatibility within minor versions:
- 0.5.0 → 0.5.6: Fully compatible
- 0.4.x → 0.5.x: Deprecation warnings, but works
- 0.3.x → 0.5.x: May require configuration updates
Getting Help with Migration
If you encounter issues during migration:
- Check migration issues: Search GitHub Issues with label
migration
- Ask in Slack: Join https://slack.sglang.io/ and ask in #general or #help
- Consult documentation: Check version-specific docs
- Report problems: File an issue with your migration scenario
See Also