DeepSeek Models

DeepSeek is a series of advanced reasoning-optimized models featuring Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE) architectures. SGLang provides extensive optimizations specifically designed for DeepSeek models.

SGLang is the official recommended inference engine by the DeepSeek team for DeepSeek-V3/R1.

Overview

Supported DeepSeek Models

DeepSeek R1 (0528, 0730) - Latest reasoning models with RL
DeepSeek V3.1/V3 - 671B MoE (37B active) with MLA
DeepSeek V2 - Previous generation MLA+MoE
DeepSeek-VL2 - Vision-language model
DeepSeek-OCR / OCR-2 - Document understanding
DeepSeek-Janus-Pro - Image understanding & generation

Key Features

Multi-head Latent Attention (MLA): Compressed KV cache for efficiency
Mixture-of-Experts (MoE): 671B total, 37B active parameters
FP8 Native: Official models already in FP8 format
Advanced Reasoning: Trained with reinforcement learning

Quick Start

Single Node (8×H200)

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --trust-remote-code

Multi-Node Example (2×8 H100)

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 16 \
  --trust-remote-code

Hardware Requirements

Recommended configurations for DeepSeek V3/R1:

Weight Type	Hardware Configuration
FP8 (recommended)	8×H200
	8×B200
	8×MI300X
	2×8×H100/H800/H20
	Xeon 6980P CPU
BF16 (upcast)	2×8×H200
	2×8×MI300X
	4×8×H100/H800/H20
	4×8×A100/A800
INT8 Quantized	16×A100/A800
	32×L40S
	4×Atlas 800I A3
W4A8 Quantized	8×H20/H100, 4×H200
AWQ Quantized	8×H100/H800/H20
	8×A100/A800
MXFP4	8×MI355X/350X, 4×MI355X/350X
NVFP4	8×B200, 4×B200

The official DeepSeek V3/R1 models are already in FP8 format. Do NOT use --quantization fp8 when loading them.

SGLang Optimizations for DeepSeek

SGLang provides several model-specific optimizations:

1. Multi-head Latent Attention (MLA)

Description: MLA compresses KV cache for improved efficiency. SGLang implements:

Weight Absorption: Reordered computation for balanced memory access
Multiple MLA Backends: FlashAttention3, Flashinfer, FlashMLA, CutlassMLA, TRTLLM MLA (Blackwell), Triton
FP8 Quantization: W8A8 FP8 and KV Cache FP8
CUDA Graph & Torch.compile: Reduced latency for small batches
Chunked Prefix Cache: Long sequence optimization (FlashAttention3 only)

Achieved up to 7× acceleration in output throughput. Usage: MLA optimization is enabled by default.

2. Data Parallelism Attention (DP Attention)

Description: Reduces KV cache size by distributing attention across DP workers, enabling larger batch sizes. KV cache is stored per DP rank instead of duplicating across all TP ranks. Performance: Up to 1.9× throughput improvement in high batch size scenarios. Usage:

# Single node with DP attention
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --enable-dp-attention \
  --tp 8 \
  --dp 8 \
  --trust-remote-code

# Multi-node: 2 nodes, 8 H100 each
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --enable-dp-attention \
  --tp 16 \
  --dp 2 \
  --trust-remote-code

DP attention is optimized for high-throughput scenarios with large batch sizes. Not recommended for low-latency, small-batch use cases.

3. Block-wise FP8 Quantization

Description: Optimized FP8 quantization with:

Activation: E4M3 format with per-token-per-128-channel sub-vector scales
Weight: Per-128×128-block quantization for numerical stability
DeepGEMM: Kernel library optimized for FP8 matrix multiplications

Usage: Enabled by default on Hopper/Blackwell GPUs. To precompile DeepGEMM kernels (recommended, ~10 minutes):

python3 -m sglang.compile_deep_gemm \
  --model deepseek-ai/DeepSeek-V3 \
  --tp 8 \
  --trust-remote-code

To disable DeepGEMM:

SGLANG_ENABLE_JIT_DEEPGEMM=0 python3 -m sglang.launch_server ...

4. Multi-token Prediction (MTP)

Description: EAGLE-based speculative decoding for DeepSeek models. Performance:

1.8× speedup for batch size 1
1.5× speedup for batch size 32

Usage:

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3-0324 \
  --speculative-algorithm EAGLE \
  --trust-remote-code \
  --tp 8

Optional parameters (defaults shown):

--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4

For large batch sizes (>48), adjust:

--max-running-requests 64 \  # Increase from default 48
--cuda-graph-bs 1,2,4,8,16,32,64  # Customize CUDA graph batch sizes

Enable experimental overlap scheduler with SGLANG_ENABLE_SPEC_V2=1 for improved performance.

5. Multi-Node Tensor Parallelism

Deploy DeepSeek across multiple nodes for models that don’t fit in single-node memory. Examples:

Reasoning Content (DeepSeek R1 & V3.1)

DeepSeek R1 and V3.1 models can separate reasoning tokens from final answers.

Enable Reasoning Parser

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --reasoning-parser deepseek-r1 \
  --trust-remote-code

Using Reasoning in Requests

import openai

client = openai.Client(base_url="http://localhost:30000/v1", api_key="-")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Solve: 2x + 5 = 13"}],
    max_tokens=1024
)

# Access reasoning separately
print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

Thinking Budget

Control reasoning token budget with custom logit processors:

python3 -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --reasoning-parser deepseek-r1 \
  --enable-custom-logit-processor

import openai
from sglang.srt.sampling.custom_logit_processor import DeepSeekR1ThinkingBudgetLogitProcessor

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Question: Is Paris the capital of France?"}],
    max_tokens=1024,
    extra_body={
        "custom_logit_processor": DeepSeekR1ThinkingBudgetLogitProcessor().to_str(),
        "custom_params": {"thinking_budget": 512},
    },
)

Function Calling

Enable tool calling for DeepSeek models:

python3 -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V3-0324 \
  --tp 8 \
  --tool-call-parser deepseekv3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja

Example Request

curl "http://127.0.0.1:30000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "temperature": 0,
    "max_tokens": 100,
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "tools": [{
      "type": "function",
      "function": {
        "name": "query_weather",
        "description": "Get weather of a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {"type": "string", "description": "The city name"}
          },
          "required": ["city"]
        }
      }
    }],
    "messages": [{"role": "user", "content": "How is the weather in Beijing?"}]
  }'

Multimodal DeepSeek Models

DeepSeek-VL2

Vision-language model for image understanding:

python3 -m sglang.launch_server \
  --model-path deepseek-ai/deepseek-vl2 \
  --tp 2 \
  --trust-remote-code

DeepSeek-OCR / OCR-2

Document understanding and text extraction:

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-OCR-2 \
  --trust-remote-code

Example request:

import requests

url = "http://localhost:30000/v1/chat/completions"

data = {
    "model": "deepseek-ai/DeepSeek-OCR-2",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "<image>\n<|grounding|>Convert the document to markdown."},
                {"type": "image_url", "image_url": {"url": "https://example.com/document.jpg"}},
            ],
        }
    ],
    "max_tokens": 512,
}

response = requests.post(url, json=data)
print(response.text)

DeepSeek-Janus-Pro

Image understanding & generation:

python3 -m sglang.launch_server \
  --model-path deepseek-ai/Janus-Pro-7B \
  --trust-remote-code

Platform-Specific Deployment

AMD GPUs (MI300X)

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp 8 \
  --attention-backend triton \
  --trust-remote-code

See: AMD GPU Guide

CPU (Xeon 6980P)

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1 \
  --device cpu \
  --trust-remote-code

Ascend NPU (Atlas 800I A3)

See: Ascend NPU Guide

Quantization Options

INT8 Quantization

python3 -m sglang.launch_server \
  --model-path meituan/DeepSeek-R1-Channel-INT8 \
  --quantization int8 \
  --tp 16

AWQ Quantization

python3 -m sglang.launch_server \
  --model-path QuixiAI/DeepSeek-R1-0528-AWQ \
  --quantization awq \
  --tp 8

W4A8 Quantization

python3 -m sglang.launch_server \
  --model-path novita/Deepseek-R1-0528-W4AFP8 \
  --tp 8

Performance Tips

Download Weights First

Ensure weights are fully downloaded before starting:

huggingface-cli download deepseek-ai/DeepSeek-R1

Increase Timeout for Large Models

--dist-timeout 3600  # 1 hour timeout

Parallel Weight Loading

--model-loader-extra-config '{"enable_multithread_load": true}'

Memory Optimization

--mem-fraction-static 0.9  # Adjust based on available memory

Resources

Troubleshooting

NCCL Timeout During Loading

Increase distributed timeout:

--dist-timeout 3600

Out of Memory

Reduce memory fraction:

--mem-fraction-static 0.85

Or use quantized models (INT8/AWQ/W4A8).

Slow First Request

Precompile DeepGEMM kernels:

python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Supported DeepSeek Models

​Key Features

​Quick Start

​Single Node (8×H200)

​Multi-Node Example (2×8 H100)

​Hardware Requirements

​SGLang Optimizations for DeepSeek

​1. Multi-head Latent Attention (MLA)

​2. Data Parallelism Attention (DP Attention)

​3. Block-wise FP8 Quantization

​4. Multi-token Prediction (MTP)

​5. Multi-Node Tensor Parallelism

​Reasoning Content (DeepSeek R1 & V3.1)

​Enable Reasoning Parser

​Using Reasoning in Requests

​Thinking Budget

​Function Calling

​Example Request

​Multimodal DeepSeek Models

​DeepSeek-VL2

​DeepSeek-OCR / OCR-2

​DeepSeek-Janus-Pro

​Platform-Specific Deployment

​AMD GPUs (MI300X)

​CPU (Xeon 6980P)

​Ascend NPU (Atlas 800I A3)

​Quantization Options

​INT8 Quantization

​AWQ Quantization

​W4A8 Quantization

​Performance Tips

​Download Weights First

​Increase Timeout for Large Models

​Parallel Weight Loading

​Memory Optimization

​Resources

​Troubleshooting

​NCCL Timeout During Loading

​Out of Memory

​Slow First Request

Overview

Supported DeepSeek Models

Key Features

Quick Start

Single Node (8×H200)

Multi-Node Example (2×8 H100)

Hardware Requirements

SGLang Optimizations for DeepSeek

1. Multi-head Latent Attention (MLA)

2. Data Parallelism Attention (DP Attention)

3. Block-wise FP8 Quantization

4. Multi-token Prediction (MTP)

5. Multi-Node Tensor Parallelism

Reasoning Content (DeepSeek R1 & V3.1)

Enable Reasoning Parser

Using Reasoning in Requests

Thinking Budget

Function Calling

Example Request

Multimodal DeepSeek Models

DeepSeek-VL2

DeepSeek-OCR / OCR-2

DeepSeek-Janus-Pro

Platform-Specific Deployment

AMD GPUs (MI300X)

CPU (Xeon 6980P)

Ascend NPU (Atlas 800I A3)

Quantization Options

INT8 Quantization

AWQ Quantization

W4A8 Quantization

Performance Tips

Download Weights First

Increase Timeout for Large Models

Parallel Weight Loading

Memory Optimization

Resources

Troubleshooting

NCCL Timeout During Loading

Out of Memory

Slow First Request