Skip to main content
DeepSeek is a series of advanced reasoning-optimized models featuring Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE) architectures. SGLang provides extensive optimizations specifically designed for DeepSeek models.
SGLang is the official recommended inference engine by the DeepSeek team for DeepSeek-V3/R1.

Overview

Supported DeepSeek Models

  • DeepSeek R1 (0528, 0730) - Latest reasoning models with RL
  • DeepSeek V3.1/V3 - 671B MoE (37B active) with MLA
  • DeepSeek V2 - Previous generation MLA+MoE
  • DeepSeek-VL2 - Vision-language model
  • DeepSeek-OCR / OCR-2 - Document understanding
  • DeepSeek-Janus-Pro - Image understanding & generation

Key Features

  • Multi-head Latent Attention (MLA): Compressed KV cache for efficiency
  • Mixture-of-Experts (MoE): 671B total, 37B active parameters
  • FP8 Native: Official models already in FP8 format
  • Advanced Reasoning: Trained with reinforcement learning

Quick Start

Single Node (8×H200)

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --trust-remote-code

Multi-Node Example (2×8 H100)

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 16 \
  --trust-remote-code

Hardware Requirements

Recommended configurations for DeepSeek V3/R1:
Weight TypeHardware Configuration
FP8 (recommended)8×H200
8×B200
8×MI300X
2×8×H100/H800/H20
Xeon 6980P CPU
BF16 (upcast)2×8×H200
2×8×MI300X
4×8×H100/H800/H20
4×8×A100/A800
INT8 Quantized16×A100/A800
32×L40S
4×Atlas 800I A3
W4A8 Quantized8×H20/H100, 4×H200
AWQ Quantized8×H100/H800/H20
8×A100/A800
MXFP48×MI355X/350X, 4×MI355X/350X
NVFP48×B200, 4×B200
The official DeepSeek V3/R1 models are already in FP8 format. Do NOT use --quantization fp8 when loading them.

SGLang Optimizations for DeepSeek

SGLang provides several model-specific optimizations:

1. Multi-head Latent Attention (MLA)

Description: MLA compresses KV cache for improved efficiency. SGLang implements:
  • Weight Absorption: Reordered computation for balanced memory access
  • Multiple MLA Backends: FlashAttention3, Flashinfer, FlashMLA, CutlassMLA, TRTLLM MLA (Blackwell), Triton
  • FP8 Quantization: W8A8 FP8 and KV Cache FP8
  • CUDA Graph & Torch.compile: Reduced latency for small batches
  • Chunked Prefix Cache: Long sequence optimization (FlashAttention3 only)
Achieved up to 7× acceleration in output throughput. Usage: MLA optimization is enabled by default.

2. Data Parallelism Attention (DP Attention)

Description: Reduces KV cache size by distributing attention across DP workers, enabling larger batch sizes. KV cache is stored per DP rank instead of duplicating across all TP ranks. Performance: Up to 1.9× throughput improvement in high batch size scenarios. Usage:
# Single node with DP attention
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --enable-dp-attention \
  --tp 8 \
  --dp 8 \
  --trust-remote-code

# Multi-node: 2 nodes, 8 H100 each
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --enable-dp-attention \
  --tp 16 \
  --dp 2 \
  --trust-remote-code
DP attention is optimized for high-throughput scenarios with large batch sizes. Not recommended for low-latency, small-batch use cases.

3. Block-wise FP8 Quantization

Description: Optimized FP8 quantization with:
  • Activation: E4M3 format with per-token-per-128-channel sub-vector scales
  • Weight: Per-128×128-block quantization for numerical stability
  • DeepGEMM: Kernel library optimized for FP8 matrix multiplications
Usage: Enabled by default on Hopper/Blackwell GPUs. To precompile DeepGEMM kernels (recommended, ~10 minutes):
python3 -m sglang.compile_deep_gemm \
  --model deepseek-ai/DeepSeek-V3 \
  --tp 8 \
  --trust-remote-code
To disable DeepGEMM:
SGLANG_ENABLE_JIT_DEEPGEMM=0 python3 -m sglang.launch_server ...

4. Multi-token Prediction (MTP)

Description: EAGLE-based speculative decoding for DeepSeek models. Performance:
  • 1.8× speedup for batch size 1
  • 1.5× speedup for batch size 32
Usage:
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --speculative-algorithm EAGLE \
  --trust-remote-code \
  --tp 8
Optional parameters (defaults shown):
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
For large batch sizes (>48), adjust:
--max-running-requests 64 \  # Increase from default 48
--cuda-graph-bs 1,2,4,8,16,32,64  # Customize CUDA graph batch sizes
Enable experimental overlap scheduler with SGLANG_ENABLE_SPEC_V2=1 for improved performance.

5. Multi-Node Tensor Parallelism

Deploy DeepSeek across multiple nodes for models that don’t fit in single-node memory. Examples:

Reasoning Content (DeepSeek R1 & V3.1)

DeepSeek R1 and V3.1 models can separate reasoning tokens from final answers.

Enable Reasoning Parser

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --reasoning-parser deepseek-r1 \
  --trust-remote-code

Using Reasoning in Requests

import openai

client = openai.Client(base_url="http://localhost:30000/v1", api_key="-")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Solve: 2x + 5 = 13"}],
    max_tokens=1024
)

# Access reasoning separately
print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

Thinking Budget

Control reasoning token budget with custom logit processors:
python3 -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --reasoning-parser deepseek-r1 \
  --enable-custom-logit-processor
import openai
from sglang.srt.sampling.custom_logit_processor import DeepSeekR1ThinkingBudgetLogitProcessor

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Question: Is Paris the capital of France?"}],
    max_tokens=1024,
    extra_body={
        "custom_logit_processor": DeepSeekR1ThinkingBudgetLogitProcessor().to_str(),
        "custom_params": {"thinking_budget": 512},
    },
)

Function Calling

Enable tool calling for DeepSeek models:
python3 -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V3-0324 \
  --tp 8 \
  --tool-call-parser deepseekv3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja

Example Request

curl "http://127.0.0.1:30000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "temperature": 0,
    "max_tokens": 100,
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "tools": [{
      "type": "function",
      "function": {
        "name": "query_weather",
        "description": "Get weather of a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {"type": "string", "description": "The city name"}
          },
          "required": ["city"]
        }
      }
    }],
    "messages": [{"role": "user", "content": "How is the weather in Beijing?"}]
  }'

Multimodal DeepSeek Models

DeepSeek-VL2

Vision-language model for image understanding:
python3 -m sglang.launch_server \
  --model-path deepseek-ai/deepseek-vl2 \
  --tp 2 \
  --trust-remote-code

DeepSeek-OCR / OCR-2

Document understanding and text extraction:
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-OCR-2 \
  --trust-remote-code
Example request:
import requests

url = "http://localhost:30000/v1/chat/completions"

data = {
    "model": "deepseek-ai/DeepSeek-OCR-2",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "<image>\n<|grounding|>Convert the document to markdown."},
                {"type": "image_url", "image_url": {"url": "https://example.com/document.jpg"}},
            ],
        }
    ],
    "max_tokens": 512,
}

response = requests.post(url, json=data)
print(response.text)

DeepSeek-Janus-Pro

Image understanding & generation:
python3 -m sglang.launch_server \
  --model-path deepseek-ai/Janus-Pro-7B \
  --trust-remote-code

Platform-Specific Deployment

AMD GPUs (MI300X)

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --attention-backend triton \
  --trust-remote-code
See: AMD GPU Guide

CPU (Xeon 6980P)

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --device cpu \
  --trust-remote-code

Ascend NPU (Atlas 800I A3)

See: Ascend NPU Guide

Quantization Options

INT8 Quantization

python3 -m sglang.launch_server \
  --model-path meituan/DeepSeek-R1-Channel-INT8 \
  --quantization int8 \
  --tp 16

AWQ Quantization

python3 -m sglang.launch_server \
  --model-path QuixiAI/DeepSeek-R1-0528-AWQ \
  --quantization awq \
  --tp 8

W4A8 Quantization

python3 -m sglang.launch_server \
  --model-path novita/Deepseek-R1-0528-W4AFP8 \
  --tp 8

Performance Tips

Download Weights First

Ensure weights are fully downloaded before starting:
huggingface-cli download deepseek-ai/DeepSeek-R1

Increase Timeout for Large Models

--dist-timeout 3600  # 1 hour timeout

Parallel Weight Loading

--model-loader-extra-config '{"enable_multithread_load": true}'

Memory Optimization

--mem-fraction-static 0.9  # Adjust based on available memory

Resources

Troubleshooting

NCCL Timeout During Loading

Increase distributed timeout:
--dist-timeout 3600

Out of Memory

Reduce memory fraction:
--mem-fraction-static 0.85
Or use quantized models (INT8/AWQ/W4A8).

Slow First Request

Precompile DeepGEMM kernels:
python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8