Prerequisites
Install SGLang
Launch Your First Server
Start the SGLang server
Launch a server with a small model for testing:
The server will download the model from Hugging Face on first launch. Set the
HF_TOKEN environment variable if you need to access gated models:Common Server Launch Options
Common Server Launch Options
Send Your First Request
Using OpenAI-Compatible API
SGLang provides OpenAI-compatible endpoints, making it easy to integrate with existing applications.Using Native SGLang API
The native SGLang API provides a more Pythonic interface with advanced features.Complete Working Example
Here’s a full end-to-end example you can run:API Endpoints
SGLang provides several API endpoints:| Endpoint | Description | OpenAI Compatible |
|---|---|---|
/v1/chat/completions | Chat completions | ✅ |
/v1/completions | Text completions | ✅ |
/v1/embeddings | Generate embeddings | ✅ |
/generate | Native SGLang generation | ❌ |
/get_model_info | Get model metadata | ❌ |
/health | Health check | ❌ |
Sampling Parameters
Control generation behavior with these common parameters:Common Sampling Parameters
Common Sampling Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature | float | 1.0 | Controls randomness (0.0 = deterministic, higher = more creative) |
top_p | float | 1.0 | Nucleus sampling threshold |
max_tokens | int | 128 | Maximum tokens to generate |
frequency_penalty | float | 0.0 | Penalize token frequency (-2.0 to 2.0) |
presence_penalty | float | 0.0 | Penalize new tokens (-2.0 to 2.0) |
stop | str/list | None | Stop sequences |
n | int | 1 | Number of completions to generate |
stream | bool | false | Enable streaming responses |
Common Use Cases
Text Completion
Text Completion
JSON Mode / Structured Output
JSON Mode / Structured Output
Batch Inference
Batch Inference
Troubleshooting
Server won't start
Server won't start
Out of memory error:CUDA errors:
- Verify GPU availability:
nvidia-smi - Check CUDA version:
nvcc --version - Ensure PyTorch can see GPUs:
python -c "import torch; print(torch.cuda.is_available())"
Slow inference
Slow inference
- Use tensor parallelism for multi-GPU:
--tp 2 - Enable FP8 quantization:
--quantization fp8 - Reduce context length:
--context-length 4096 - See Performance Tuning for optimization
Connection refused
Connection refused
- Ensure server is running: check for “Uvicorn running” message
- Verify port is not in use:
lsof -i :30000 - Check firewall settings
- Use correct host/port in client
Next Steps
Server Arguments
Learn about all available server configuration options
Sampling Parameters
Control generation behavior with sampling parameters
Model Support
Browse supported models and architectures
Production Deployment
Deploy SGLang in production with monitoring
