Chat-Aligned Models

Qwen-Chat models are fine-tuned versions of the base Qwen models, aligned with human preferences for conversational interactions. These models are optimized for chatbot applications, content generation, and interactive AI assistants.

Overview

Chat models are built on top of base models through supervised fine-tuning (SFT) using the ChatML format:

Qwen-1.8B-Chat, Qwen-7B-Chat, Qwen-14B-Chat, Qwen-72B-Chat
Aligned with human intent through curated instruction data
Enhanced safety and service-oriented capabilities
Support for tool usage, code interpretation, and agent behavior

Fine-tuning Process

Training Data

The alignment dataset includes three major categories:

Instruction Data

Covers broad capabilities for practical applications:

Writing: Content creation, story generation, copywriting
Question Answering: Factual queries, explanations, knowledge retrieval
Brainstorming & Planning: Idea generation, task planning
Content Understanding: Summarization, analysis, interpretation
Natural Language Processing: Text manipulation, extraction, transformation
Coding: Code generation, debugging, explanation

Safety Data

Prevents harmful and inappropriate content generation:

Refusal of harmful requests
Bias mitigation
Safety-aligned responses
Content filtering

Service Data

Enables specific conversation patterns for external system integration:

Tool invocation protocols
API calling patterns
Search integration
Multi-step reasoning (ReAct)

ChatML Format

Conversations are formatted using ChatML, a meta language for structured dialogue:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello, who are you?<|im_end|>
<|im_start|>assistant
I am a language model called Qwen, created by Alibaba Cloud.<|im_end|>

Roles:

system: Sets behavior and context
user: Human input
assistant: Model responses

Training Configuration

Objective: Causal language modeling (user content tokens excluded from loss)
Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁶)
Sequence Length: 2048 tokens
Batch Size: 128
Training Steps: 4000
Learning Rate: Peak 1×10⁻⁵ with 1430-step warm-up
Regularization: Weight decay 0.1, dropout 0.1, gradient clipping 1.0

Benchmark Performance

Chinese Language Understanding

C-Eval (Zero-shot, generative) - Validation set:

Model	Average Accuracy
LLaMA2-7B-Chat	31.9
LLaMA2-13B-Chat	40.6
Chinese-Alpaca-Plus-13B	43.3
Baichuan-13B-Chat	50.4
ChatGLM2-6B-Chat	50.7
InternLM-7B-Chat	53.2
Qwen-7B-Chat	54.2

C-Eval Test Set (Zero-shot):

Model	Average	STEM	Social Sciences	Humanities	Others
ChatGLM2-6B-Chat	50.1	46.4	60.4	50.6	46.9
Baichuan-13B-Chat	51.5	43.7	64.6	56.2	49.2
Qwen-7B-Chat	54.6	47.8	67.6	59.3	50.6

English Language Understanding

MMLU (Zero-shot):

Model	Average Accuracy
ChatGLM2-6B-Chat	45.5
LLaMA2-7B-Chat	47.0
InternLM-7B-Chat	50.8
Baichuan-13B-Chat	52.1
ChatGLM2-12B-Chat	52.1
Qwen-7B-Chat	53.9

Coding

HumanEval (Zero-shot Pass@1):

Model	Pass@1
LLaMA2-7B-Chat	12.2
InternLM-7B-Chat	14.0
Baichuan-13B-Chat	16.5
LLaMA2-13B-Chat	18.9
Qwen-7B-Chat	24.4

Mathematical Reasoning

GSM8K (Math word problems):

Model	Zero-shot	4-shot
ChatGLM2-6B-Chat	-	28.0
LLaMA2-7B-Chat	20.4	28.2
LLaMA2-13B-Chat	29.4	36.7
InternLM-7B-Chat	32.6	34.5
Baichuan-13B-Chat	-	36.3
ChatGLM2-12B-Chat	-	38.1
Qwen-7B-Chat	41.1	43.5

Tool Usage

Qwen-Chat excels at tool invocation through ReAct prompting: Custom Tool Usage Benchmark:

Model	Tool Selection (Acc.)	Tool Input (Rouge-L)	False Positive Rate
GPT-4	95%	0.90	15.0%
GPT-3.5	85%	0.88	75.0%
Qwen-7B-Chat	99%	0.89	9.7%

Evaluation plugins do not appear in Qwen’s training data, demonstrating genuine generalization.

HuggingFace Agent Benchmark:

Model	Tool Selection↑	Tool Used↑	Code↑
GPT-4	100.00	100.00	97.41
GPT-3.5	95.37	96.30	87.04
StarCoder-15.5B	87.04	87.96	68.89
Qwen-7B-Chat	90.74	92.59	74.07

Core Capabilities

Conversational AI

Qwen-Chat models excel at multi-turn conversations with context awareness:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# First turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# Output: 你好！很高兴为你提供帮助。

# Second turn (with history)
response, history = model.chat(
    tokenizer,
    "给我讲一个年轻人奋斗创业最终取得成功的故事。",
    history=history
)
print(response)
# Output: 这是一个关于一个年轻人奋斗创业最终取得成功的故事...

# Third turn
response, history = model.chat(
    tokenizer,
    "给这个故事起一个标题",
    history=history
)
print(response)
# Output: 《奋斗创业：一个年轻人的成功之路》

Tool Integration

Qwen-Chat supports tool usage through ReAct prompting:

ReAct Prompting
HuggingFace Agent

The model can reason about which tools to use and generate appropriate calls:

# Example: Using search and calculator tools
prompt = """
Answer the following questions as best you can. You have access to the following tools:

search: useful for searching information
calculator: useful for mathematical calculations

Use the following format:
Thought: you should always think about what to do
Action: the action to take, should be one of [search, calculator]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Question: What is the population of Tokyo multiplied by 2?
"""

response, _ = model.chat(tokenizer, prompt, history=None)
# Model will generate structured tool calls

Qwen-Chat can be used as a HuggingFace Agent:

from transformers import HfAgent

agent = HfAgent(
    "https://api-inference.huggingface.co/models/Qwen/Qwen-7B-Chat"
)

agent.run(
    "Generate an image of a cat and describe it"
)

Code Interpretation

Chat models can generate, explain, and debug code:

response, _ = model.chat(
    tokenizer,
    "Write a Python function to calculate the Fibonacci sequence",
    history=None
)
# Model generates complete, working Python code

System Prompt Enhancement

Qwen-1.8B-Chat and Qwen-72B-Chat have strengthened system prompt capabilities:

# Using system prompts for behavior control
messages = [
    {"role": "system", "content": "You are a helpful assistant that speaks like a pirate."},
    {"role": "user", "content": "Tell me about the weather"}
]

response, _ = model.chat(tokenizer, messages)
# Model will respond in pirate-speak style

Quantized Variants

Chat models are available in quantized formats for efficient deployment:

Performance Comparison

Qwen-7B-Chat:

Quantization	MMLU	C-Eval (val)	GSM8K	HumanEval
BF16	55.8	59.7	50.3	37.2
Int8	55.4	59.4	48.3	34.8
Int4	55.1	59.2	49.7	29.9

Qwen-14B-Chat:

Quantization	MMLU	C-Eval (val)	GSM8K	HumanEval
BF16	64.6	69.8	60.1	43.9
Int8	63.6	68.6	60.0	48.2
Int4	63.3	69.0	59.8	45.7

Qwen-72B-Chat:

Quantization	MMLU	C-Eval (val)	GSM8K	HumanEval
BF16	74.4	80.1	76.4	64.6
Int8	73.5	80.1	73.5	62.2
Int4	73.4	80.1	75.3	61.6

Quantization causes minimal performance degradation while significantly reducing memory requirements.

Batch Inference

Chat models support batch inference for improved throughput:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from qwen_generation_utils import make_context, decode_tokens

tokenizer = AutoTokenizer.from_pretrained(
    'Qwen/Qwen-7B-Chat',
    pad_token='<|extra_0|>',
    eos_token='<|endoftext|>',
    padding_side='left',
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    'Qwen/Qwen-7B-Chat',
    pad_token_id=tokenizer.pad_token_id,
    device_map="auto",
    trust_remote_code=True
).eval()

questions = [
    "What is the capital of France?",
    "How do I make pancakes?",
    "Explain quantum computing"
]

# Prepare batch
batch_raw_text = []
for q in questions:
    raw_text, _ = make_context(
        tokenizer, q,
        system="You are a helpful assistant.",
        max_window_size=model.generation_config.max_window_size,
        chat_format=model.generation_config.chat_format,
    )
    batch_raw_text.append(raw_text)

# Batch generation
batch_input_ids = tokenizer(batch_raw_text, padding='longest')
batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device)

batch_out_ids = model.generate(
    batch_input_ids,
    generation_config=model.generation_config
)

# Decode responses
for i, response in enumerate(batch_out_ids):
    decoded = tokenizer.decode(response, skip_special_tokens=True)
    print(f"Q{i+1}: {decoded}")

With Flash Attention enabled, batch inference provides ~40% speedup over sequential processing.

Streaming Responses

Chat models support streaming for real-time response generation:

# Using chat_stream for token-by-token generation
for response in model.chat_stream(
    tokenizer,
    "Tell me a story",
    history=None
):
    print(response, end="", flush=True)

Hardware Requirements

Inference Memory (Generating 2048 tokens)

Qwen-1.8B-Chat
Qwen-7B-Chat
Qwen-14B-Chat
Qwen-72B-Chat

Precision	GPU Memory	Speed (tokens/s)
BF16	4.23GB	54.09
Int8	3.48GB	55.56
Int4	2.91GB	71.07

Precision	GPU Memory	Speed (tokens/s)
BF16	16.99GB	40.93
Int8	11.20GB	37.47
Int4	8.21GB	50.09

Precision	GPU Memory	Speed (tokens/s)
BF16	30.15GB	32.22
Int8	18.81GB	29.28
Int4	13.01GB	38.72

Precision	GPU Memory	Speed (tokens/s)
BF16	144.69GB (2×A100)	8.48
Int8	81.27GB (2×A100)	9.05
Int4	48.86GB	11.32

Fine-tuning Memory (Q-LoRA, batch_size=1, gradient_accumulation=8)

Model Size	Min GPU Memory
1.8B	5.8GB
7B	11.5GB
14B	18.7GB
72B	61.4GB

Model Downloads

Qwen-1.8B-Chat

🤗 HF | 🤖 MS | Int4 | Int8

Qwen-7B-Chat

🤗 HF | 🤖 MS | Int4 | Int8

Qwen-14B-Chat

🤗 HF | 🤖 MS | Int4 | Int8

Qwen-72B-Chat

🤗 HF | 🤖 MS | Int4 | Int8

Safety Considerations

While Qwen-Chat models include safety alignment, they may still generate inappropriate content in some cases. Developers should:

Perform red teaming before deployment
Implement content filtering for production use
Monitor outputs for harmful content
Comply with local regulations and policies

Next Steps

Model Selection

Choose the right chat model for your needs

Tool Usage

Learn to integrate external tools

Fine-tuning Chat

Customize chat models for your domain

Deployment

Deploy chat models to production

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Chat-Aligned Models

​Overview

​Fine-tuning Process

​Training Data

​ChatML Format

​Training Configuration

​Benchmark Performance

​Chinese Language Understanding

​English Language Understanding

​Coding

​Mathematical Reasoning

​Tool Usage

​Core Capabilities

​Conversational AI

​Tool Integration

​Code Interpretation

​System Prompt Enhancement

​Quantized Variants

​Performance Comparison

​Batch Inference

​Streaming Responses

​Hardware Requirements

​Inference Memory (Generating 2048 tokens)

​Fine-tuning Memory (Q-LoRA, batch_size=1, gradient_accumulation=8)

​Model Downloads

Qwen-1.8B-Chat

Qwen-7B-Chat

Qwen-14B-Chat

Qwen-72B-Chat

​Safety Considerations

​Next Steps

Model Selection

Tool Usage

Fine-tuning Chat

Deployment

Build docs developers (and LLMs) love

Chat-Aligned Models

Overview

Fine-tuning Process

Training Data

ChatML Format

Training Configuration

Benchmark Performance

Chinese Language Understanding

English Language Understanding

Coding

Mathematical Reasoning

Tool Usage

Core Capabilities

Conversational AI

Tool Integration

Code Interpretation

System Prompt Enhancement

Quantized Variants

Performance Comparison

Batch Inference

Streaming Responses

Hardware Requirements

Inference Memory (Generating 2048 tokens)

Fine-tuning Memory (Q-LoRA, batch_size=1, gradient_accumulation=8)

Model Downloads

Safety Considerations

Next Steps