SGLang

High-performance serving framework for large language models and multimodal models

Achieve low-latency, high-throughput inference from a single GPU to large distributed clusters with RadixAttention, continuous batching, and advanced optimizations.

Get started API reference View on GitHub

Quick start

Get SGLang running in minutes

Install SGLang

Install SGLang using pip:

pip install "sglang[all]"

For AMD ROCm GPUs, use pip install "sglang[all]" --index-url https://download.pytorch.org/whl/rocm6.2

Launch a server

Start the SGLang server with a model:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000

Or use the CLI:

sglang serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000

Send a request

Use the OpenAI-compatible API to send requests:

import openai

client = openai.Client(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

Example response

The capital of France is Paris.

Key features

Everything you need for high-performance LLM serving

RadixAttention

Efficient prefix caching that automatically reuses KV cache across requests

Continuous batching

Zero-overhead CPU scheduler for optimal GPU utilization

Distributed serving

Multi-dimensional parallelism: tensor, pipeline, expert, and data parallelism

Structured outputs

Fast JSON decoding with compressed finite state machines

Multi-LoRA batching

Serve multiple LoRA adapters in a single batch

Quantization

Support for FP4, FP8, INT4, AWQ, and GPTQ quantization

Explore by topic

Deep dive into specific areas

Core concepts

Understand SGLang’s architecture and design

Backend runtime

Configure and run the serving backend

Frontend language

Build prompts with constrained generation

Model support

100+ supported LLM and multimodal models

Advanced features

Structured outputs, LoRA, speculative decoding

Optimization

Performance tuning and optimization techniques

Hardware & deployment

Run SGLang on your infrastructure

Supported hardware

•NVIDIA GPUs (GB200, B300, H100, A100, Spark)
•AMD GPUs (MI355, MI300)
•Intel Xeon CPUs
•Google TPUs
•Ascend NPUs

Deployment options

→ Docker containers → Kubernetes clusters → Cloud platforms (AWS, GCP, Azure)→ Multi-node distributed setup

Resources

Additional help and information

FAQ

Common questions and answers

Environment variables

Configuration reference

Troubleshooting

Debugging and common issues

Learn more

Blog posts, papers, and talks

Join the community

SGLang powers over 400,000 GPUs worldwide and is trusted by leading enterprises. Get help, share feedback, and connect with other users.

Join Slack GitHub Weekly dev meeting

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

SGLang

Quick start

Key features

RadixAttention

Continuous batching

Distributed serving

Structured outputs

Multi-LoRA batching

Quantization

Explore by topic

Core concepts

Backend runtime

Frontend language

Model support

Advanced features

Optimization

Hardware & deployment

Supported hardware

Deployment options

Resources

FAQ

Environment variables

Troubleshooting

Learn more

Join the community