SGLang
High-performance serving framework for large language models and multimodal models
Achieve low-latency, high-throughput inference from a single GPU to large distributed clusters with RadixAttention, continuous batching, and advanced optimizations.
Quick start
Get SGLang running in minutes
Key features
Everything you need for high-performance LLM serving
RadixAttention
Efficient prefix caching that automatically reuses KV cache across requests
Continuous batching
Zero-overhead CPU scheduler for optimal GPU utilization
Distributed serving
Multi-dimensional parallelism: tensor, pipeline, expert, and data parallelism
Structured outputs
Fast JSON decoding with compressed finite state machines
Multi-LoRA batching
Serve multiple LoRA adapters in a single batch
Quantization
Support for FP4, FP8, INT4, AWQ, and GPTQ quantization
Explore by topic
Deep dive into specific areas
Core concepts
Understand SGLang’s architecture and design
Backend runtime
Configure and run the serving backend
Frontend language
Build prompts with constrained generation
Model support
100+ supported LLM and multimodal models
Advanced features
Structured outputs, LoRA, speculative decoding
Optimization
Performance tuning and optimization techniques
Hardware & deployment
Run SGLang on your infrastructure
Supported hardware
- •NVIDIA GPUs (GB200, B300, H100, A100, Spark)
- •AMD GPUs (MI355, MI300)
- •Intel Xeon CPUs
- •Google TPUs
- •Ascend NPUs
Resources
Additional help and information
FAQ
Common questions and answers
Environment variables
Configuration reference
Troubleshooting
Debugging and common issues
Learn more
Blog posts, papers, and talks
Join the community
SGLang powers over 400,000 GPUs worldwide and is trusted by leading enterprises. Get help, share feedback, and connect with other users.
