Skip to main content
You can install SGLang using one of the methods below. This page primarily applies to common NVIDIA GPU platforms. For other platforms, please refer to the dedicated pages for AMD GPUs, Intel Xeon CPUs, TPU, Ascend NPUs, and Intel XPU.

Quick Installation

1

Install with pip or uv (Recommended)

It is recommended to use uv for faster installation:
pip install --upgrade pip
pip install uv
uv pip install sglang
This will install SGLang with CUDA 12.9 support by default, which is compatible with most recent NVIDIA GPUs.
2

Verify Installation

After installation, verify that SGLang is working:
python -c "import sglang; print(sglang.__version__)"

Installation Methods

Method 1: Install with pip or uv

The simplest way to install SGLang:
pip install --upgrade pip
pip install uv
uv pip install sglang
For CUDA 13 support on B300/GB300 GPUs, Docker is recommended. If you don’t have Docker access:
1

Install PyTorch with CUDA 13

# Replace X.Y.Z with the version required by your SGLang install
uv pip install torch==X.Y.Z torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
2

Install SGLang

uv pip install sglang
3

Install sgl_kernel for CUDA 13

Install the appropriate wheel from sgl-project whl releases:
uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sgl_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_x86_64.whl"
Replace X.Y.Z with the sgl_kernel version from uv pip show sgl_kernel.
CUDA_HOME not set:If you encounter OSError: CUDA_HOME environment variable is not set, try one of these solutions:
export CUDA_HOME=/usr/local/cuda-<your-cuda-version>
Reinstall FlashInfer:
pip3 install --upgrade flashinfer-python --force-reinstall --no-deps
rm -rf ~/.cache/flashinfer
Blackwell GPU (B300/GB300) ptxas error:
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

Method 2: Install from Source

For development or to use the latest features:
# Use the last release branch
git clone -b v0.5.6.post2 https://github.com/sgl-project/sglang.git
cd sglang

# Install the python packages
pip install --upgrade pip
pip install -e "python"
For development, use the dev docker image lmsysorg/sglang:dev. See the development guide for details.

Method 3: Using Docker

Docker images are available at lmsysorg/sglang.
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
Replace <secret> with your Hugging Face hub token.
The runtime variant is ~40% smaller than the full image by excluding build tools and development dependencies, making it ideal for production deployments.

Method 4: Using Kubernetes

Check out OME, a Kubernetes operator for enterprise-grade LLM serving.
For models that fit on one node:
kubectl apply -f docker/k8s-sglang-service.yaml
For large models requiring multiple GPU nodes:
# Modify the model path and arguments in the file first
kubectl apply -f docker/k8s-sglang-distributed-sts.yaml

Method 5: Using Docker Compose

For production, use the k8s-sglang-service.yaml instead.
1

Copy compose.yml

Download the compose.yml to your local machine.
2

Launch the service

docker compose up -d

Method 6: Using SkyPilot

Deploy on Kubernetes or 12+ clouds with SkyPilot.
1

Install SkyPilot

Follow SkyPilot’s documentation to install and configure cloud access.
2

Create sglang.yaml

sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000
3

Deploy

# Deploy on any cloud or Kubernetes
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
For autoscaling and failure recovery, check out the SkyServe + SGLang guide.

Method 7: AWS SageMaker

AWS provides SGLang DLCs with routine security patching.To host with your own container:
1

Build Docker container

Build with sagemaker.Dockerfile and the serve script.
2

Push to AWS ECR

build-and-push.sh
#!/bin/bash
AWS_ACCOUNT="<YOUR_AWS_ACCOUNT>"
AWS_REGION="<YOUR_AWS_REGION>"
REPOSITORY_NAME="<YOUR_REPOSITORY_NAME>"
IMAGE_TAG="<YOUR_IMAGE_TAG>"

ECR_REGISTRY="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com"
IMAGE_URI="${ECR_REGISTRY}/${REPOSITORY_NAME}:${IMAGE_TAG}"

# Login to ECR
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${ECR_REGISTRY}

# Build and push
docker build -t ${IMAGE_URI} -f sagemaker.Dockerfile .
docker push ${IMAGE_URI}
3

Deploy model

Use deploy_and_serve_endpoint.py to deploy. See sagemaker-python-sdk for more details.
Customize server parameters using environment variables with the SM_SGLANG_ prefix. For example, SM_SGLANG_MODEL_PATH=Qwen/Qwen3-0.6B and SM_SGLANG_REASONING_PARSER=qwen3.

Hardware-Specific Installation

AMD GPUs (ROCm)

For AMD GPUs like MI300X:
git clone -b v0.5.6.post2 https://github.com/sgl-project/sglang.git
cd sglang

# Compile sgl-kernel
pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install

# Install sglang
cd ..
rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
pip install -e "python[all_hip]"
See the AMD GPU documentation for system optimization and tuning guides.

Google TPU

SGLang supports TPUs through the SGLang-JAX backend:
pip install sglang-jax
See the TPU documentation for feature support and optimized models.

Intel CPUs

See the CPU Server documentation for Intel Xeon CPU deployment instructions.

Ascend NPUs

See the Ascend NPU documentation for Huawei Ascend NPU installation.

Dependencies

SGLang has the following core dependencies (from pyproject.toml):
  • Python: >=3.10
  • PyTorch: 2.9.1 (CUDA 12.9 by default)
  • FlashInfer: 0.6.4 (attention kernel backend)
  • Transformers: 4.57.1
  • FastAPI: Web server framework
  • OpenAI: 2.6.1 (API compatibility)
  • sgl-kernel: 0.3.21 (custom CUDA kernels)
Optional dependencies:
  • Diffusion: For image/video generation models
  • Tracing: OpenTelemetry integration
  • Test: Development and testing tools

Common Notes

FlashInfer: Default attention kernel backend. Only supports sm75 and above (T4, A10, A100, L4, L40S, H100, B200, etc.).If you encounter FlashInfer issues on supported GPUs, switch to alternative backends:
python -m sglang.launch_server --model-path <model> --attention-backend triton --sampling-backend pytorch
Shared Memory: Docker and Kubernetes deployments require sufficient shared memory (--shm-size 32g for Docker, update /dev/shm size for Kubernetes).

Next Steps

After installation:
  1. Launch your first server
  2. Send API requests
  3. Explore server arguments
  4. Learn about model support