Cloud Platform Deployment

Overview

SGLang supports deployment across major cloud platforms, leveraging managed services for Kubernetes, GPUs, and TPUs. This guide covers platform-specific configurations and best practices.

Amazon Web Services (AWS)

AWS SageMaker

AWS SageMaker provides managed inference with built-in SGLang container support.

Prerequisites

AWS account with SageMaker access
IAM role with SageMaker permissions
AWS CLI configured
SGLang container on Amazon ECR

Build and Push Container

# Set AWS configuration
export AWS_ACCOUNT="<YOUR_AWS_ACCOUNT>"
export AWS_REGION="<YOUR_AWS_REGION>"
export ECR_REGISTRY="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com"

# Build SageMaker container
docker build -f docker/sagemaker.Dockerfile -t sglang-sagemaker .

# Tag for ECR
docker tag sglang-sagemaker:latest ${ECR_REGISTRY}/sglang-sagemaker:latest

# Login to ECR
aws ecr get-login-password --region ${AWS_REGION} | \
  docker login --username AWS --password-stdin ${ECR_REGISTRY}

# Create repository if it doesn't exist
aws ecr create-repository --repository-name sglang-sagemaker --region ${AWS_REGION}

# Push image
docker push ${ECR_REGISTRY}/sglang-sagemaker:latest

Deploy Model Endpoint

Use the SageMaker Python SDK:

import sagemaker
from sagemaker import Model

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Define model
model = Model(
    image_uri=f"{ECR_REGISTRY}/sglang-sagemaker:latest",
    role=role,
    sagemaker_session=sagemaker_session,
    env={
        "SM_SGLANG_MODEL_PATH": "meta-llama/Llama-3.1-8B-Instruct",
        "SM_SGLANG_HOST": "0.0.0.0",
        "SM_SGLANG_PORT": "8080",
        "SM_SGLANG_TP": "1",
        "HF_TOKEN": "<your_hf_token>"
    }
)

# Deploy endpoint
predictor = model.deploy(
    instance_type="ml.g5.xlarge",
    initial_instance_count=1,
    endpoint_name="sglang-llama-endpoint"
)

SageMaker Environment Variables

The SageMaker container uses environment variables with the SM_SGLANG_ prefix:

SM_SGLANG_MODEL_PATH=/opt/ml/model  # Default model path
SM_SGLANG_HOST=0.0.0.0              # Server host
SM_SGLANG_PORT=8080                 # Server port
SM_SGLANG_TP=1                      # Tensor parallelism

All SGLang launch arguments can be set using this pattern:

SM_SGLANG_<ARGUMENT_NAME>=<value>
# Example: --max-running-requests becomes
SM_SGLANG_MAX_RUNNING_REQUESTS=32

Query SageMaker Endpoint

import json
import boto3

runtime = boto3.client("sagemaker-runtime")

response = runtime.invoke_endpoint(
    EndpointName="sglang-llama-endpoint",
    ContentType="application/json",
    Body=json.dumps({
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 100
    })
)

result = json.loads(response["Body"].read())
print(result)

AWS Deep Learning Containers

AWS maintains official SGLang containers with security patches:

# Check available images
aws ecr describe-images \
  --repository-name sglang \
  --registry-id <AWS_DLC_ACCOUNT> \
  --region ${AWS_REGION}

See AWS SGLang DLCs for the latest images.

Amazon EKS

Deploy SGLang on Elastic Kubernetes Service:

Create EKS Cluster

# Install eksctl
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Create cluster with GPU nodes
eksctl create cluster \
  --name sglang-cluster \
  --region us-west-2 \
  --nodegroup-name gpu-nodes \
  --node-type p3.2xlarge \
  --nodes 2 \
  --nodes-min 1 \
  --nodes-max 4

Install NVIDIA Device Plugin

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Deploy SGLang

Follow the Kubernetes deployment guide with EKS-specific configurations:

apiVersion: v1
kind: Service
metadata:
  name: sglang-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  selector:
    app: sglang
  ports:
    - port: 80
      targetPort: 30000

AWS EC2

Direct deployment on EC2 GPU instances:

Launch GPU Instance

# Launch p3.2xlarge instance with NVIDIA Deep Learning AMI
aws ec2 run-instances \
  --image-id ami-0c55b159cbfafe1f0 \
  --instance-type p3.2xlarge \
  --key-name your-key-pair \
  --security-groups sglang-sg \
  --block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=200}'

Install and Run SGLang

# SSH into instance
ssh -i your-key.pem ubuntu@<instance-ip>

# Pull and run Docker container
docker pull lmsysorg/sglang:latest
docker run -d --gpus all -p 30000:30000 \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 30000

Google Cloud Platform (GCP)

Google Kubernetes Engine (GKE)

Create GKE Cluster with GPUs

# Create cluster
gcloud container clusters create sglang-cluster \
  --zone us-central1-a \
  --machine-type n1-standard-8 \
  --accelerator type=nvidia-tesla-v100,count=1 \
  --num-nodes 2 \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 4

# Install NVIDIA driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

Deploy SGLang on GKE

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sglang-gke
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sglang
  template:
    metadata:
      labels:
        app: sglang
    spec:
      containers:
      - name: sglang
        image: lmsysorg/sglang:latest
        command:
        - python3
        - -m
        - sglang.launch_server
        - --model-path
        - meta-llama/Llama-3.1-8B-Instruct
        - --host
        - 0.0.0.0
        - --port
        - "30000"
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: cache
          mountPath: /root/.cache
      volumes:
      - name: cache
        emptyDir: {}
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-v100

Google Cloud TPU

SGLang supports TPU inference through the JAX backend:

Prerequisites

TPU v5e, v6e, or v7 instance
SGLang-JAX installation

Using SkyPilot

# sky-tpu.yaml
resources:
  cloud: gcp
  accelerators: tpu-v5e-8
  disk_size: 256

setup: |
  pip install "sglang-jax[tpu]"

run: |
  python -m sglang_jax.launch_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

Deploy with SkyPilot:

# Install SkyPilot
pip install skypilot-nightly[gcp]

# Configure GCP access
gcloud auth application-default login

# Launch
sky launch -c tpu-cluster sky-tpu.yaml

# Check status
sky status tpu-cluster

Direct TPU VM Setup

# Create TPU VM
gcloud compute tpus tpu-vm create sglang-tpu \
  --zone=us-central2-b \
  --accelerator-type=v5litepod-8 \
  --version=tpu-ubuntu2204-base

# SSH into TPU VM
gcloud compute tpus tpu-vm ssh sglang-tpu --zone=us-central2-b

# Install SGLang-JAX
pip install "sglang-jax[tpu]"

# Launch server
python -m sglang_jax.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

Google Compute Engine

# Create GPU instance
gcloud compute instances create sglang-vm \
  --zone=us-central1-a \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-v100,count=1 \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=200GB \
  --maintenance-policy=TERMINATE

# Install NVIDIA drivers
gcloud compute ssh sglang-vm -- 'curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py && sudo python3 install_gpu_driver.py'

# Install Docker and run SGLang
gcloud compute ssh sglang-vm -- 'sudo apt-get update && sudo apt-get install -y docker.io nvidia-container-toolkit && sudo docker run -d --gpus all -p 30000:30000 lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000'

Microsoft Azure

Azure Kubernetes Service (AKS)

Create AKS Cluster

# Create resource group
az group create --name sglang-rg --location eastus

# Create AKS cluster with GPU nodes
az aks create \
  --resource-group sglang-rg \
  --name sglang-aks \
  --node-count 2 \
  --node-vm-size Standard_NC6s_v3 \
  --generate-ssh-keys

# Get credentials
az aks get-credentials --resource-group sglang-rg --name sglang-aks

# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Deploy SGLang

Use standard Kubernetes manifests from the Kubernetes guide.

Azure VM

# Create GPU VM
az vm create \
  --resource-group sglang-rg \
  --name sglang-vm \
  --image UbuntuLTS \
  --size Standard_NC6s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

# Open port 30000
az vm open-port --port 30000 --resource-group sglang-rg --name sglang-vm

# SSH and install
az vm run-command invoke \
  --resource-group sglang-rg \
  --name sglang-vm \
  --command-id RunShellScript \
  --scripts "curl -fsSL https://get.docker.com | sh && sudo apt-get install -y nvidia-container-toolkit && sudo docker run -d --gpus all -p 30000:30000 lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000"

Azure Container Instances

az container create \
  --resource-group sglang-rg \
  --name sglang-aci \
  --image lmsysorg/sglang:latest \
  --cpu 4 \
  --memory 16 \
  --gpu-count 1 \
  --gpu-sku V100 \
  --ports 30000 \
  --command-line "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000"

Other Cloud Providers

Oracle Cloud Infrastructure (OCI)

# Launch GPU instance
oci compute instance launch \
  --availability-domain <AD> \
  --compartment-id <COMPARTMENT_ID> \
  --shape VM.GPU3.1 \
  --image-id <UBUNTU_IMAGE_ID> \
  --subnet-id <SUBNET_ID>

Alibaba Cloud

# Create ECS instance with GPU
aliyun ecs CreateInstance \
  --RegionId cn-hangzhou \
  --InstanceType ecs.gn6i-c4g1.xlarge \
  --ImageId ubuntu_22_04_x64

Lambda Labs

Lambda Labs provides cost-effective GPU cloud:

# Launch instance via Lambda Cloud dashboard
# SSH into instance
ssh ubuntu@<instance-ip>

# Install SGLang
pip install "sglang[all]"

# Launch server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

Cloud Storage Integration

AWS S3 for Models

# Download model from S3
import boto3
s3 = boto3.client('s3')
s3.download_file('my-bucket', 'models/llama-8b', '/models/llama-8b')

# Launch with local model
python -m sglang.launch_server --model-path /models/llama-8b

Google Cloud Storage

# Download model
gsutil -m cp -r gs://my-bucket/models/llama-8b /models/

# Launch
python -m sglang.launch_server --model-path /models/llama-8b

Azure Blob Storage

# Install Azure CLI
az storage blob download-batch \
  --account-name mystorageaccount \
  --source models \
  --destination /models/

Cost Optimization

Use Spot/Preemptible Instances

AWS Spot Instances:

aws ec2 run-instances \
  --instance-type p3.2xlarge \
  --instance-market-options MarketType=spot

GCP Preemptible VMs:

gcloud compute instances create sglang-vm \
  --preemptible \
  --machine-type n1-standard-8 \
  --accelerator type=nvidia-tesla-v100,count=1

Azure Spot VMs:

az vm create \
  --priority Spot \
  --max-price -1 \
  --size Standard_NC6s_v3

Auto-Scaling

Implement cluster autoscaling to scale down during low usage:

# AWS EKS
aws eks update-nodegroup-config \
  --cluster-name sglang-cluster \
  --nodegroup-name gpu-nodes \
  --scaling-config minSize=0,maxSize=4,desiredSize=1

# GKE
gcloud container clusters update sglang-cluster \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 4

Security Best Practices

Network Security

Use private subnets for compute instances
Implement VPC peering for multi-region deployments
Configure security groups to restrict access:

# AWS security group
aws ec2 create-security-group \
  --group-name sglang-sg \
  --description "SGLang security group"

aws ec2 authorize-security-group-ingress \
  --group-name sglang-sg \
  --protocol tcp \
  --port 30000 \
  --cidr 10.0.0.0/16

Secrets Management

AWS Secrets Manager:

import boto3
client = boto3.client('secretsmanager')
response = client.get_secret_value(SecretId='hf-token')
token = response['SecretString']

GCP Secret Manager:

echo -n "your-token" | gcloud secrets create hf-token --data-file=-
kubectl create secret generic hf-token \
  --from-literal=token=$(gcloud secrets versions access latest --secret=hf-token)

Azure Key Vault:

az keyvault secret set \
  --vault-name sglang-vault \
  --name hf-token \
  --value "your-token"

Monitoring and Logging

AWS CloudWatch

import boto3
cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='SGLang',
    MetricData=[
        {
            'MetricName': 'Requests',
            'Value': 100,
            'Unit': 'Count'
        },
    ]
)

GCP Cloud Logging

gcloud logging read "resource.type=k8s_container AND resource.labels.container_name=sglang" \
  --limit 50 \
  --format json

Azure Monitor

az monitor metrics list \
  --resource /subscriptions/<sub-id>/resourceGroups/sglang-rg/providers/Microsoft.ContainerService/managedClusters/sglang-aks \
  --metric CPUUsagePercentage

Next Steps

Kubernetes Deployment - Detailed K8s configurations
Multi-Node Setup - Distributed deployments
Docker Deployment - Container configurations

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Amazon Web Services (AWS)

​AWS SageMaker

​Prerequisites

​Build and Push Container

​Deploy Model Endpoint

​SageMaker Environment Variables

​Query SageMaker Endpoint

​AWS Deep Learning Containers

​Amazon EKS

​Create EKS Cluster

​Install NVIDIA Device Plugin

​Deploy SGLang

​AWS EC2

​Launch GPU Instance

​Install and Run SGLang

​Google Cloud Platform (GCP)

​Google Kubernetes Engine (GKE)

​Create GKE Cluster with GPUs

​Deploy SGLang on GKE

​Google Cloud TPU

​Prerequisites

​Using SkyPilot

​Direct TPU VM Setup

​Google Compute Engine

​Microsoft Azure

​Azure Kubernetes Service (AKS)

​Create AKS Cluster

​Deploy SGLang

​Azure VM

​Azure Container Instances

​Other Cloud Providers

​Oracle Cloud Infrastructure (OCI)

​Alibaba Cloud

​Lambda Labs

​Cloud Storage Integration

​AWS S3 for Models

​Google Cloud Storage

​Azure Blob Storage

​Cost Optimization

​Use Spot/Preemptible Instances

​Auto-Scaling

​Security Best Practices

​Network Security

​Secrets Management

​Monitoring and Logging

​AWS CloudWatch

​GCP Cloud Logging

​Azure Monitor

​Next Steps

Overview

Amazon Web Services (AWS)

AWS SageMaker

Prerequisites

Build and Push Container

Deploy Model Endpoint

SageMaker Environment Variables

Query SageMaker Endpoint

AWS Deep Learning Containers

Amazon EKS

Create EKS Cluster

Install NVIDIA Device Plugin

Deploy SGLang

AWS EC2

Launch GPU Instance

Install and Run SGLang

Google Cloud Platform (GCP)

Google Kubernetes Engine (GKE)

Create GKE Cluster with GPUs

Deploy SGLang on GKE

Google Cloud TPU

Prerequisites

Using SkyPilot

Direct TPU VM Setup

Google Compute Engine

Microsoft Azure

Azure Kubernetes Service (AKS)

Create AKS Cluster

Deploy SGLang

Azure VM

Azure Container Instances

Other Cloud Providers

Oracle Cloud Infrastructure (OCI)

Alibaba Cloud

Lambda Labs

Cloud Storage Integration

AWS S3 for Models

Google Cloud Storage

Azure Blob Storage

Cost Optimization

Use Spot/Preemptible Instances

Auto-Scaling

Security Best Practices

Network Security

Secrets Management

Monitoring and Logging

AWS CloudWatch

GCP Cloud Logging

Azure Monitor

Next Steps