Skip to main content

Chat-Aligned Models

Qwen-Chat models are fine-tuned versions of the base Qwen models, aligned with human preferences for conversational interactions. These models are optimized for chatbot applications, content generation, and interactive AI assistants.

Overview

Chat models are built on top of base models through supervised fine-tuning (SFT) using the ChatML format:
  • Qwen-1.8B-Chat, Qwen-7B-Chat, Qwen-14B-Chat, Qwen-72B-Chat
  • Aligned with human intent through curated instruction data
  • Enhanced safety and service-oriented capabilities
  • Support for tool usage, code interpretation, and agent behavior

Fine-tuning Process

Training Data

The alignment dataset includes three major categories:
Covers broad capabilities for practical applications:
  • Writing: Content creation, story generation, copywriting
  • Question Answering: Factual queries, explanations, knowledge retrieval
  • Brainstorming & Planning: Idea generation, task planning
  • Content Understanding: Summarization, analysis, interpretation
  • Natural Language Processing: Text manipulation, extraction, transformation
  • Coding: Code generation, debugging, explanation
Prevents harmful and inappropriate content generation:
  • Refusal of harmful requests
  • Bias mitigation
  • Safety-aligned responses
  • Content filtering
Enables specific conversation patterns for external system integration:
  • Tool invocation protocols
  • API calling patterns
  • Search integration
  • Multi-step reasoning (ReAct)

ChatML Format

Conversations are formatted using ChatML, a meta language for structured dialogue:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, who are you?<|im_end|>
<|im_start|>assistant
I am a language model called Qwen, created by Alibaba Cloud.<|im_end|>
Roles:
  • system: Sets behavior and context
  • user: Human input
  • assistant: Model responses

Training Configuration

  • Objective: Causal language modeling (user content tokens excluded from loss)
  • Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁶)
  • Sequence Length: 2048 tokens
  • Batch Size: 128
  • Training Steps: 4000
  • Learning Rate: Peak 1×10⁻⁵ with 1430-step warm-up
  • Regularization: Weight decay 0.1, dropout 0.1, gradient clipping 1.0

Benchmark Performance

Chinese Language Understanding

C-Eval (Zero-shot, generative) - Validation set:
ModelAverage Accuracy
LLaMA2-7B-Chat31.9
LLaMA2-13B-Chat40.6
Chinese-Alpaca-Plus-13B43.3
Baichuan-13B-Chat50.4
ChatGLM2-6B-Chat50.7
InternLM-7B-Chat53.2
Qwen-7B-Chat54.2
C-Eval Test Set (Zero-shot):
ModelAverageSTEMSocial SciencesHumanitiesOthers
ChatGLM2-6B-Chat50.146.460.450.646.9
Baichuan-13B-Chat51.543.764.656.249.2
Qwen-7B-Chat54.647.867.659.350.6

English Language Understanding

MMLU (Zero-shot):
ModelAverage Accuracy
ChatGLM2-6B-Chat45.5
LLaMA2-7B-Chat47.0
InternLM-7B-Chat50.8
Baichuan-13B-Chat52.1
ChatGLM2-12B-Chat52.1
Qwen-7B-Chat53.9

Coding

HumanEval (Zero-shot Pass@1):
ModelPass@1
LLaMA2-7B-Chat12.2
InternLM-7B-Chat14.0
Baichuan-13B-Chat16.5
LLaMA2-13B-Chat18.9
Qwen-7B-Chat24.4

Mathematical Reasoning

GSM8K (Math word problems):
ModelZero-shot4-shot
ChatGLM2-6B-Chat-28.0
LLaMA2-7B-Chat20.428.2
LLaMA2-13B-Chat29.436.7
InternLM-7B-Chat32.634.5
Baichuan-13B-Chat-36.3
ChatGLM2-12B-Chat-38.1
Qwen-7B-Chat41.143.5

Tool Usage

Qwen-Chat excels at tool invocation through ReAct prompting: Custom Tool Usage Benchmark:
ModelTool Selection (Acc.)Tool Input (Rouge-L)False Positive Rate
GPT-495%0.9015.0%
GPT-3.585%0.8875.0%
Qwen-7B-Chat99%0.899.7%
Evaluation plugins do not appear in Qwen’s training data, demonstrating genuine generalization.
HuggingFace Agent Benchmark:
ModelTool Selection↑Tool Used↑Code↑
GPT-4100.00100.0097.41
GPT-3.595.3796.3087.04
StarCoder-15.5B87.0487.9668.89
Qwen-7B-Chat90.7492.5974.07

Core Capabilities

Conversational AI

Qwen-Chat models excel at multi-turn conversations with context awareness:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# First turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# Output: 你好!很高兴为你提供帮助。

# Second turn (with history)
response, history = model.chat(
    tokenizer,
    "给我讲一个年轻人奋斗创业最终取得成功的故事。",
    history=history
)
print(response)
# Output: 这是一个关于一个年轻人奋斗创业最终取得成功的故事...

# Third turn
response, history = model.chat(
    tokenizer,
    "给这个故事起一个标题",
    history=history
)
print(response)
# Output: 《奋斗创业:一个年轻人的成功之路》

Tool Integration

Qwen-Chat supports tool usage through ReAct prompting:
The model can reason about which tools to use and generate appropriate calls:
# Example: Using search and calculator tools
prompt = """
Answer the following questions as best you can. You have access to the following tools:

search: useful for searching information
calculator: useful for mathematical calculations

Use the following format:
Thought: you should always think about what to do
Action: the action to take, should be one of [search, calculator]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Question: What is the population of Tokyo multiplied by 2?
"""

response, _ = model.chat(tokenizer, prompt, history=None)
# Model will generate structured tool calls

Code Interpretation

Chat models can generate, explain, and debug code:
response, _ = model.chat(
    tokenizer,
    "Write a Python function to calculate the Fibonacci sequence",
    history=None
)
# Model generates complete, working Python code

System Prompt Enhancement

Qwen-1.8B-Chat and Qwen-72B-Chat have strengthened system prompt capabilities:
# Using system prompts for behavior control
messages = [
    {"role": "system", "content": "You are a helpful assistant that speaks like a pirate."},
    {"role": "user", "content": "Tell me about the weather"}
]

response, _ = model.chat(tokenizer, messages)
# Model will respond in pirate-speak style

Quantized Variants

Chat models are available in quantized formats for efficient deployment:

Performance Comparison

Qwen-7B-Chat:
QuantizationMMLUC-Eval (val)GSM8KHumanEval
BF1655.859.750.337.2
Int855.459.448.334.8
Int455.159.249.729.9
Qwen-14B-Chat:
QuantizationMMLUC-Eval (val)GSM8KHumanEval
BF1664.669.860.143.9
Int863.668.660.048.2
Int463.369.059.845.7
Qwen-72B-Chat:
QuantizationMMLUC-Eval (val)GSM8KHumanEval
BF1674.480.176.464.6
Int873.580.173.562.2
Int473.480.175.361.6
Quantization causes minimal performance degradation while significantly reducing memory requirements.

Batch Inference

Chat models support batch inference for improved throughput:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from qwen_generation_utils import make_context, decode_tokens

tokenizer = AutoTokenizer.from_pretrained(
    'Qwen/Qwen-7B-Chat',
    pad_token='<|extra_0|>',
    eos_token='<|endoftext|>',
    padding_side='left',
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    'Qwen/Qwen-7B-Chat',
    pad_token_id=tokenizer.pad_token_id,
    device_map="auto",
    trust_remote_code=True
).eval()

questions = [
    "What is the capital of France?",
    "How do I make pancakes?",
    "Explain quantum computing"
]

# Prepare batch
batch_raw_text = []
for q in questions:
    raw_text, _ = make_context(
        tokenizer, q,
        system="You are a helpful assistant.",
        max_window_size=model.generation_config.max_window_size,
        chat_format=model.generation_config.chat_format,
    )
    batch_raw_text.append(raw_text)

# Batch generation
batch_input_ids = tokenizer(batch_raw_text, padding='longest')
batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device)

batch_out_ids = model.generate(
    batch_input_ids,
    generation_config=model.generation_config
)

# Decode responses
for i, response in enumerate(batch_out_ids):
    decoded = tokenizer.decode(response, skip_special_tokens=True)
    print(f"Q{i+1}: {decoded}")
With Flash Attention enabled, batch inference provides ~40% speedup over sequential processing.

Streaming Responses

Chat models support streaming for real-time response generation:
# Using chat_stream for token-by-token generation
for response in model.chat_stream(
    tokenizer,
    "Tell me a story",
    history=None
):
    print(response, end="", flush=True)

Hardware Requirements

Inference Memory (Generating 2048 tokens)

PrecisionGPU MemorySpeed (tokens/s)
BF164.23GB54.09
Int83.48GB55.56
Int42.91GB71.07

Fine-tuning Memory (Q-LoRA, batch_size=1, gradient_accumulation=8)

Model SizeMin GPU Memory
1.8B5.8GB
7B11.5GB
14B18.7GB
72B61.4GB

Model Downloads

Qwen-1.8B-Chat

🤗 HF | 🤖 MS | Int4 | Int8

Qwen-7B-Chat

🤗 HF | 🤖 MS | Int4 | Int8

Qwen-14B-Chat

🤗 HF | 🤖 MS | Int4 | Int8

Qwen-72B-Chat

🤗 HF | 🤖 MS | Int4 | Int8

Safety Considerations

While Qwen-Chat models include safety alignment, they may still generate inappropriate content in some cases. Developers should:
  • Perform red teaming before deployment
  • Implement content filtering for production use
  • Monitor outputs for harmful content
  • Comply with local regulations and policies

Next Steps

Model Selection

Choose the right chat model for your needs

Tool Usage

Learn to integrate external tools

Fine-tuning Chat

Customize chat models for your domain

Deployment

Deploy chat models to production

Build docs developers (and LLMs) love