Skip to main content
Qwen (通义千问) is Alibaba Cloud’s series of large language models and multimodal models, ranging from compact 0.6B models to massive 397B MoE architectures.

Overview

The Qwen family includes:
  • Qwen 3.5 - Latest generation with hybrid attention and MoE
  • Qwen 3 - Dense and MoE variants with reasoning capabilities
  • Qwen 2.5 - Previous generation, highly capable
  • Qwen 2 - Foundation models
  • Qwen-VL - Vision-language multimodal models
  • Qwen-Audio - Audio-enabled models

Quick Start

Basic Dense Model

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-0.6B-Instruct \
  --host 0.0.0.0 \
  --port 30000

Large MoE Model (Qwen 3.5)

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-397B-A17B \
  --tp 8 \
  --trust-remote-code

Qwen 3.5 Architecture

Qwen 3.5 features cutting-edge architectural innovations:

Key Features

  • Hybrid Attention: Gated Delta Networks (linear, O(n) complexity) combined with full attention every 4th layer
  • MoE with Shared Experts: Top-8 active out of 64 routed experts plus a dedicated shared expert
  • Multimodal: DeepStack Vision Transformer with Conv3d for native image and video understanding

Launch Qwen 3.5 (Dense)

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-397B-A17B \
  --tp 8 \
  --trust-remote-code

AMD GPU Support (MI300X / MI325X / MI35X)

On AMD Instinct GPUs, use the Triton attention backend:
SGLANG_USE_AITER=1 python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-397B-A17B \
  --tp 8 \
  --attention-backend triton \
  --trust-remote-code
Tip: Set SGLANG_USE_AITER=1 to enable AMD’s optimized aiter kernels for MoE and GEMM operations.

Configuration Tips for Large Models

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-397B-A17B \
  --tp 8 \
  --trust-remote-code \
  --watchdog-timeout 1200 \  # Increase for large model weight loading
  --model-loader-extra-config '{"enable_multithread_load": true}'  # Parallel weight loading

Qwen 3 Models

Qwen 3 offers a range of sizes from 0.6B to 235B (MoE):

Available Models

ModelParametersTypeUse Case
Qwen3-0.6B0.6BDenseEdge/mobile devices
Qwen3-1.7B1.7BDenseLightweight deployment
Qwen3-4B4BDenseBalanced performance
Qwen3-7B7BDenseGeneral purpose
Qwen3-14B14BDenseAdvanced tasks
Qwen3-30B-A3B30B total, 3B activeMoEEfficient large model
Qwen3-235B-A22B235B total, 22B activeMoELargest Qwen 3

Launch Examples

# Lightweight model (0.6B)
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-0.6B-Instruct \
  --port 30000

# Mid-size model (7B)
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-7B-Instruct \
  --port 30000

# MoE model (30B total, 3B active)
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-30B-A3B-Instruct \
  --tp 2 \
  --trust-remote-code

Reasoning and Tool Calling

Qwen models support advanced reasoning and tool calling capabilities:

Enable Reasoning Parser

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-397B-A17B \
  --tp 8 \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder

Using Reasoning in Requests

With the reasoning parser enabled, the model can separate reasoning tokens from the final answer:
import openai

client = openai.Client(base_url="http://localhost:8000/v1", api_key="-")

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=512
)

# Access reasoning content separately
print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

Qwen 2.5 & Qwen 2 Models

Previous generation Qwen models are also fully supported:
# Qwen 2.5 models
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --port 30000

# Qwen 2 MoE
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen2-57B-A14B-Instruct \
  --tp 4 \
  --port 30000

Qwen-VL (Vision-Language Models)

Qwen-VL models process both images and text. See the Multimodal Models guide for complete details.

Quick Launch

# Qwen3-VL (latest)
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-VL-30B-A3B-Instruct \
  --tp 2 \
  --ep 2 \
  --host 0.0.0.0 \
  --port 30000

# Qwen2.5-VL
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-VL-7B-Instruct \
  --port 30000

FP8 Mode (Memory Efficient)

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tp 8 \
  --ep 8 \
  --keep-mm-feature-on-device

Image Request Example

import requests

url = "http://localhost:30000/v1/chat/completions"

data = {
    "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print(response.json())

Video Input Support

import requests

url = "http://localhost:30000/v1/chat/completions"

data = {
    "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's happening in this video?"},
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/video.mp4"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print(response.json())

Qwen-Audio Models

Qwen2-Audio processes audio input alongside text:
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen2-Audio-7B-Instruct \
  --port 30000

Qwen Classification & Reward Models

SGLang supports specialized Qwen variants:

Classification Models

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen2-7B-Classification \
  --port 30000

Reward Models

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen2-7B-Reward \
  --port 30000

Qwen3-Omni (Omnimodal)

Qwen3-Omni is an omni-modal MoE model supporting text, images, audio, and video:
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --tp 2 \
  --ep 2 \
  --port 30000
Note: Currently supports the Thinker component (multimodal understanding) only. Audio generation (Talker) is not yet supported.

Performance Optimization

Expert Parallelism (EP)

For large MoE models, use expert parallelism:
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-235B-A22B-Instruct \
  --tp 8 \
  --ep 8 \
  --trust-remote-code

Quantization

Reduce memory usage with quantization:
# FP8 quantization
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-7B-Instruct \
  --quantization fp8 \
  --port 30000

# AWQ quantization
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-7B-Instruct-AWQ \
  --quantization awq \
  --port 30000

Chunked Prefill

For long-context scenarios:
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-7B-Instruct \
  --chunked-prefill-size 8192 \
  --port 30000

Accuracy Evaluation

Evaluate model accuracy using lm-eval:
pip install lm-eval[api]

lm_eval --model local-completions \
  --model_args '{"base_url": "http://localhost:8000/v1/completions", "model": "Qwen/Qwen3.5-397B-A17B", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \
  --tasks gsm8k \
  --batch_size auto \
  --num_fewshot 5 \
  --trust_remote_code

Supported Qwen Architectures

SGLang supports the following Qwen model architectures:
  • Qwen3ForCausalLM - Qwen 3 dense models
  • Qwen3_5ForCausalLM - Qwen 3.5 dense models
  • Qwen3NextForCausalLM - Qwen 3 Next generation
  • Qwen3MoeForCausalLM - Qwen 3 MoE models
  • Qwen3OmniMoeForCausalLM - Qwen 3 Omni models
  • Qwen2ForCausalLM - Qwen 2 dense models
  • Qwen2MoeForCausalLM - Qwen 2 MoE models
  • Qwen2_5_VLForConditionalGeneration - Qwen 2.5 VL
  • Qwen3VLForConditionalGeneration - Qwen 3 VL
  • Qwen3VLMoeForConditionalGeneration - Qwen 3 VL MoE
  • Qwen2AudioForConditionalGeneration - Qwen 2 Audio
  • Qwen2ForSequenceClassification - Classification
  • Qwen3ForSequenceClassification - Classification
  • Qwen2ForRewardModel - Reward models
  • Qwen3ForRewardModel - Reward models

Resources

Troubleshooting

Large Model Loading Timeout

Increase watchdog timeout:
--watchdog-timeout 1200  # 20 minutes

Memory Issues with MoE

Adjust memory fraction:
--mem-fraction-static 0.85  # Reduce from default 0.9

AMD GPU Specific

Ensure AITER is enabled:
SGLANG_USE_AITER=1 python3 -m sglang.launch_server ...