DeepSeek is a series of advanced reasoning-optimized models featuring Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE) architectures. SGLang provides extensive optimizations specifically designed for DeepSeek models.
SGLang is the official recommended inference engine by the DeepSeek team for DeepSeek-V3/R1.
Overview
Supported DeepSeek Models
- DeepSeek R1 (0528, 0730) - Latest reasoning models with RL
- DeepSeek V3.1/V3 - 671B MoE (37B active) with MLA
- DeepSeek V2 - Previous generation MLA+MoE
- DeepSeek-VL2 - Vision-language model
- DeepSeek-OCR / OCR-2 - Document understanding
- DeepSeek-Janus-Pro - Image understanding & generation
Key Features
- Multi-head Latent Attention (MLA): Compressed KV cache for efficiency
- Mixture-of-Experts (MoE): 671B total, 37B active parameters
- FP8 Native: Official models already in FP8 format
- Advanced Reasoning: Trained with reinforcement learning
Quick Start
Single Node (8×H200)
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--tp 8 \
--trust-remote-code
Multi-Node Example (2×8 H100)
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--tp 16 \
--trust-remote-code
Hardware Requirements
Recommended configurations for DeepSeek V3/R1:
| Weight Type | Hardware Configuration |
|---|
| FP8 (recommended) | 8×H200 |
| 8×B200 |
| 8×MI300X |
| 2×8×H100/H800/H20 |
| Xeon 6980P CPU |
| BF16 (upcast) | 2×8×H200 |
| 2×8×MI300X |
| 4×8×H100/H800/H20 |
| 4×8×A100/A800 |
| INT8 Quantized | 16×A100/A800 |
| 32×L40S |
| 4×Atlas 800I A3 |
| W4A8 Quantized | 8×H20/H100, 4×H200 |
| AWQ Quantized | 8×H100/H800/H20 |
| 8×A100/A800 |
| MXFP4 | 8×MI355X/350X, 4×MI355X/350X |
| NVFP4 | 8×B200, 4×B200 |
The official DeepSeek V3/R1 models are already in FP8 format. Do NOT use --quantization fp8 when loading them.
SGLang Optimizations for DeepSeek
SGLang provides several model-specific optimizations:
1. Multi-head Latent Attention (MLA)
Description: MLA compresses KV cache for improved efficiency. SGLang implements:
- Weight Absorption: Reordered computation for balanced memory access
- Multiple MLA Backends: FlashAttention3, Flashinfer, FlashMLA, CutlassMLA, TRTLLM MLA (Blackwell), Triton
- FP8 Quantization: W8A8 FP8 and KV Cache FP8
- CUDA Graph & Torch.compile: Reduced latency for small batches
- Chunked Prefix Cache: Long sequence optimization (FlashAttention3 only)
Achieved up to 7× acceleration in output throughput.
Usage: MLA optimization is enabled by default.
2. Data Parallelism Attention (DP Attention)
Description: Reduces KV cache size by distributing attention across DP workers, enabling larger batch sizes. KV cache is stored per DP rank instead of duplicating across all TP ranks.
Performance: Up to 1.9× throughput improvement in high batch size scenarios.
Usage:
# Single node with DP attention
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--enable-dp-attention \
--tp 8 \
--dp 8 \
--trust-remote-code
# Multi-node: 2 nodes, 8 H100 each
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--enable-dp-attention \
--tp 16 \
--dp 2 \
--trust-remote-code
DP attention is optimized for high-throughput scenarios with large batch sizes. Not recommended for low-latency, small-batch use cases.
3. Block-wise FP8 Quantization
Description: Optimized FP8 quantization with:
- Activation: E4M3 format with per-token-per-128-channel sub-vector scales
- Weight: Per-128×128-block quantization for numerical stability
- DeepGEMM: Kernel library optimized for FP8 matrix multiplications
Usage: Enabled by default on Hopper/Blackwell GPUs.
To precompile DeepGEMM kernels (recommended, ~10 minutes):
python3 -m sglang.compile_deep_gemm \
--model deepseek-ai/DeepSeek-V3 \
--tp 8 \
--trust-remote-code
To disable DeepGEMM:
SGLANG_ENABLE_JIT_DEEPGEMM=0 python3 -m sglang.launch_server ...
4. Multi-token Prediction (MTP)
Description: EAGLE-based speculative decoding for DeepSeek models.
Performance:
- 1.8× speedup for batch size 1
- 1.5× speedup for batch size 32
Usage:
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3-0324 \
--speculative-algorithm EAGLE \
--trust-remote-code \
--tp 8
Optional parameters (defaults shown):
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
For large batch sizes (>48), adjust:
--max-running-requests 64 \ # Increase from default 48
--cuda-graph-bs 1,2,4,8,16,32,64 # Customize CUDA graph batch sizes
Enable experimental overlap scheduler with SGLANG_ENABLE_SPEC_V2=1 for improved performance.
5. Multi-Node Tensor Parallelism
Deploy DeepSeek across multiple nodes for models that don’t fit in single-node memory.
Examples:
Reasoning Content (DeepSeek R1 & V3.1)
DeepSeek R1 and V3.1 models can separate reasoning tokens from final answers.
Enable Reasoning Parser
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--tp 8 \
--reasoning-parser deepseek-r1 \
--trust-remote-code
Using Reasoning in Requests
import openai
client = openai.Client(base_url="http://localhost:30000/v1", api_key="-")
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=[{"role": "user", "content": "Solve: 2x + 5 = 13"}],
max_tokens=1024
)
# Access reasoning separately
print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)
Thinking Budget
Control reasoning token budget with custom logit processors:
python3 -m sglang.launch_server \
--model deepseek-ai/DeepSeek-R1 \
--tp 8 \
--reasoning-parser deepseek-r1 \
--enable-custom-logit-processor
import openai
from sglang.srt.sampling.custom_logit_processor import DeepSeekR1ThinkingBudgetLogitProcessor
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=[{"role": "user", "content": "Question: Is Paris the capital of France?"}],
max_tokens=1024,
extra_body={
"custom_logit_processor": DeepSeekR1ThinkingBudgetLogitProcessor().to_str(),
"custom_params": {"thinking_budget": 512},
},
)
Function Calling
Enable tool calling for DeepSeek models:
python3 -m sglang.launch_server \
--model deepseek-ai/DeepSeek-V3-0324 \
--tp 8 \
--tool-call-parser deepseekv3 \
--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja
Example Request
curl "http://127.0.0.1:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"temperature": 0,
"max_tokens": 100,
"model": "deepseek-ai/DeepSeek-V3-0324",
"tools": [{
"type": "function",
"function": {
"name": "query_weather",
"description": "Get weather of a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "The city name"}
},
"required": ["city"]
}
}
}],
"messages": [{"role": "user", "content": "How is the weather in Beijing?"}]
}'
Multimodal DeepSeek Models
DeepSeek-VL2
Vision-language model for image understanding:
python3 -m sglang.launch_server \
--model-path deepseek-ai/deepseek-vl2 \
--tp 2 \
--trust-remote-code
DeepSeek-OCR / OCR-2
Document understanding and text extraction:
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-OCR-2 \
--trust-remote-code
Example request:
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "deepseek-ai/DeepSeek-OCR-2",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "<image>\n<|grounding|>Convert the document to markdown."},
{"type": "image_url", "image_url": {"url": "https://example.com/document.jpg"}},
],
}
],
"max_tokens": 512,
}
response = requests.post(url, json=data)
print(response.text)
DeepSeek-Janus-Pro
Image understanding & generation:
python3 -m sglang.launch_server \
--model-path deepseek-ai/Janus-Pro-7B \
--trust-remote-code
AMD GPUs (MI300X)
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--tp 8 \
--attention-backend triton \
--trust-remote-code
See: AMD GPU Guide
CPU (Xeon 6980P)
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1 \
--device cpu \
--trust-remote-code
Ascend NPU (Atlas 800I A3)
See: Ascend NPU Guide
Quantization Options
INT8 Quantization
python3 -m sglang.launch_server \
--model-path meituan/DeepSeek-R1-Channel-INT8 \
--quantization int8 \
--tp 16
AWQ Quantization
python3 -m sglang.launch_server \
--model-path QuixiAI/DeepSeek-R1-0528-AWQ \
--quantization awq \
--tp 8
W4A8 Quantization
python3 -m sglang.launch_server \
--model-path novita/Deepseek-R1-0528-W4AFP8 \
--tp 8
Download Weights First
Ensure weights are fully downloaded before starting:
huggingface-cli download deepseek-ai/DeepSeek-R1
Increase Timeout for Large Models
--dist-timeout 3600 # 1 hour timeout
Parallel Weight Loading
--model-loader-extra-config '{"enable_multithread_load": true}'
Memory Optimization
--mem-fraction-static 0.9 # Adjust based on available memory
Resources
Troubleshooting
NCCL Timeout During Loading
Increase distributed timeout:
Out of Memory
Reduce memory fraction:
--mem-fraction-static 0.85
Or use quantized models (INT8/AWQ/W4A8).
Slow First Request
Precompile DeepGEMM kernels:
python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8