Skip to main content

SGLang

High-performance serving framework for large language models and multimodal models

Achieve low-latency, high-throughput inference from a single GPU to large distributed clusters with RadixAttention, continuous batching, and advanced optimizations.

Quick start

Get SGLang running in minutes

1

Install SGLang

Install SGLang using pip:
pip install "sglang[all]"
For AMD ROCm GPUs, use pip install "sglang[all]" --index-url https://download.pytorch.org/whl/rocm6.2
2

Launch a server

Start the SGLang server with a model:
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000
Or use the CLI:
sglang serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000
3

Send a request

Use the OpenAI-compatible API to send requests:
import openai

client = openai.Client(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)
The capital of France is Paris.

Key features

Everything you need for high-performance LLM serving

RadixAttention

Efficient prefix caching that automatically reuses KV cache across requests

Continuous batching

Zero-overhead CPU scheduler for optimal GPU utilization

Distributed serving

Multi-dimensional parallelism: tensor, pipeline, expert, and data parallelism

Structured outputs

Fast JSON decoding with compressed finite state machines

Multi-LoRA batching

Serve multiple LoRA adapters in a single batch

Quantization

Support for FP4, FP8, INT4, AWQ, and GPTQ quantization

Explore by topic

Deep dive into specific areas

Core concepts

Understand SGLang’s architecture and design

Backend runtime

Configure and run the serving backend

Frontend language

Build prompts with constrained generation

Model support

100+ supported LLM and multimodal models

Advanced features

Structured outputs, LoRA, speculative decoding

Optimization

Performance tuning and optimization techniques

Hardware & deployment

Run SGLang on your infrastructure

Supported hardware

  • NVIDIA GPUs (GB200, B300, H100, A100, Spark)
  • AMD GPUs (MI355, MI300)
  • Intel Xeon CPUs
  • Google TPUs
  • Ascend NPUs

Resources

Additional help and information

FAQ

Common questions and answers

Environment variables

Configuration reference

Troubleshooting

Debugging and common issues

Learn more

Blog posts, papers, and talks

Join the community

SGLang powers over 400,000 GPUs worldwide and is trusted by leading enterprises. Get help, share feedback, and connect with other users.