Skip to main content
This guide will help you launch an SGLang server and send your first requests using both the OpenAI-compatible API and the native SGLang API.

Prerequisites

1

Install SGLang

First, install SGLang using pip or uv:
pip install --upgrade pip
pip install uv
uv pip install sglang
See the Installation Guide for other installation methods.
2

Verify Installation

python -c "import sglang; print(sglang.__version__)"

Launch Your First Server

1

Start the SGLang server

Launch a server with a small model for testing:
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 30000
The server will download the model from Hugging Face on first launch. Set the HF_TOKEN environment variable if you need to access gated models:
export HF_TOKEN=your_huggingface_token
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --tp 2
2

Wait for the server to be ready

Look for the following message in the logs:
INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000

Send Your First Request

Using OpenAI-Compatible API

SGLang provides OpenAI-compatible endpoints, making it easy to integrate with existing applications.
from openai import OpenAI

# Create an OpenAI client pointing to SGLang server
client = OpenAI(
    base_url="http://127.0.0.1:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require authentication by default
)

# Chat completion
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)
Expected Output:
The capital of France is Paris. It is one of the most famous and beautiful cities in the world, known for its iconic landmarks like the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.

Using Native SGLang API

The native SGLang API provides a more Pythonic interface with advanced features.
import sglang as sgl
from sglang.srt.server_args import ServerArgs
import dataclasses

# Create an offline engine
server_args = ServerArgs(
    model_path="meta-llama/Llama-3.1-8B-Instruct"
)
llm = sgl.Engine(**dataclasses.asdict(server_args))

# Generate responses
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)

# Print outputs
for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt}")
    print(f"Generated: {output['text']}")
    print("=" * 50)
Expected Output:
Prompt: Hello, my name is
Generated: John, and I'm excited to share my story with you today. I grew up in a small town in the Midwest
==================================================
Prompt: The president of the United States is
Generated: the head of state and head of government of the United States of America. The president directs the executive branch
==================================================

Complete Working Example

Here’s a full end-to-end example you can run:
1

Create a Python script

Save this as quickstart.py:
quickstart.py
from openai import OpenAI

# Initialize client
client = OpenAI(
    base_url="http://127.0.0.1:30000/v1",
    api_key="EMPTY"
)

# Example 1: Simple chat completion
print("=" * 50)
print("Example 1: Simple Chat")
print("=" * 50)
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    max_tokens=100
)
print(response.choices[0].message.content)

# Example 2: Multi-turn conversation
print("\n" + "=" * 50)
print("Example 2: Multi-turn Conversation")
print("=" * 50)
messages = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "What is 15 * 23?"}
]
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    max_tokens=50
)
assistant_reply = response.choices[0].message.content
print(f"Assistant: {assistant_reply}")

# Continue the conversation
messages.append({"role": "assistant", "content": assistant_reply})
messages.append({"role": "user", "content": "Now multiply that by 2."})
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages,
    max_tokens=50
)
print(f"Assistant: {response.choices[0].message.content}")

# Example 3: Streaming response
print("\n" + "=" * 50)
print("Example 3: Streaming")
print("=" * 50)
print("Assistant: ", end="")
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Count from 1 to 5."}],
    max_tokens=50,
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# Example 4: Batch requests
print("=" * 50)
print("Example 4: Batch Processing")
print("=" * 50)
questions = [
    "What is the capital of Japan?",
    "What is the capital of Germany?",
    "What is the capital of Brazil?"
]

for question in questions:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": question}],
        max_tokens=50,
        temperature=0.0  # Deterministic output
    )
    print(f"Q: {question}")
    print(f"A: {response.choices[0].message.content}")
    print()
2

Ensure the server is running

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 30000
3

Run the example

python quickstart.py

API Endpoints

SGLang provides several API endpoints:
EndpointDescriptionOpenAI Compatible
/v1/chat/completionsChat completions
/v1/completionsText completions
/v1/embeddingsGenerate embeddings
/generateNative SGLang generation
/get_model_infoGet model metadata
/healthHealth check

Sampling Parameters

Control generation behavior with these common parameters:
ParameterTypeDefaultDescription
temperaturefloat1.0Controls randomness (0.0 = deterministic, higher = more creative)
top_pfloat1.0Nucleus sampling threshold
max_tokensint128Maximum tokens to generate
frequency_penaltyfloat0.0Penalize token frequency (-2.0 to 2.0)
presence_penaltyfloat0.0Penalize new tokens (-2.0 to 2.0)
stopstr/listNoneStop sequences
nint1Number of completions to generate
streamboolfalseEnable streaming responses
See Sampling Parameters for the complete list.

Common Use Cases

response = client.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt="Once upon a time",
    max_tokens=100,
    temperature=0.8
)
print(response.choices[0].text)
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 colors in JSON format"}
    ],
    response_format={"type": "json_object"},
    max_tokens=100
)
print(response.choices[0].message.content)
import sglang as sgl
from sglang.srt.server_args import ServerArgs
import dataclasses

server_args = ServerArgs(model_path="meta-llama/Llama-3.1-8B-Instruct")
llm = sgl.Engine(**dataclasses.asdict(server_args))

prompts = [f"Question {i}: What is 2+{i}?" for i in range(10)]
outputs = llm.generate(prompts, {"temperature": 0.0})

for prompt, output in zip(prompts, outputs):
    print(f"{prompt} -> {output['text']}")

Troubleshooting

Out of memory error:
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --mem-fraction-static 0.7
CUDA errors:
  • Verify GPU availability: nvidia-smi
  • Check CUDA version: nvcc --version
  • Ensure PyTorch can see GPUs: python -c "import torch; print(torch.cuda.is_available())"
  • Use tensor parallelism for multi-GPU: --tp 2
  • Enable FP8 quantization: --quantization fp8
  • Reduce context length: --context-length 4096
  • See Performance Tuning for optimization
  • Ensure server is running: check for “Uvicorn running” message
  • Verify port is not in use: lsof -i :30000
  • Check firewall settings
  • Use correct host/port in client

Next Steps

Server Arguments

Learn about all available server configuration options

Sampling Parameters

Control generation behavior with sampling parameters

Model Support

Browse supported models and architectures

Production Deployment

Deploy SGLang in production with monitoring