Skip to main content

Overview

Qwen models are fully compatible with the Hugging Face Transformers library, providing a standardized interface for loading and running inference. This guide covers all aspects of using Qwen with Transformers.

Installation

Install the required packages:
pip install transformers>=4.32.0
pip install torch>=2.0.0
pip install accelerate
Transformers 4.32.0 or above is required for Qwen models. For best performance, use PyTorch 2.0 or above.

Loading Models from Hugging Face

Qwen models are available on Hugging Face Hub with various configurations:
1

Initialize Tokenizer

Load the tokenizer with trust_remote_code=True:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat", 
    trust_remote_code=True
)
2

Load Model

Load the model with automatic device mapping:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()
3

Run Inference

Use the chat interface for conversational models:
response, history = model.chat(tokenizer, "Hello!", history=None)
print(response)

Precision Options

Qwen supports multiple precision formats to balance performance and memory usage:

Chat Interface

The .chat() method provides a convenient interface for conversational models:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# Single turn conversation
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# Output: 你好!很高兴为你提供帮助。

# Multi-turn conversation
response, history = model.chat(
    tokenizer, 
    "给我讲一个年轻人奋斗创业最终取得成功的故事。", 
    history=history
)
print(response)

response, history = model.chat(
    tokenizer, 
    "给这个故事起一个标题", 
    history=history
)
print(response)
# Output: 《奋斗创业:一个年轻人的成功之路》

History Format

The history is a list of tuples containing (query, response) pairs:
# Example history structure
history = [
    ("你好", "你好!很高兴为你提供帮助。"),
    ("给我讲一个故事", "这是一个关于...的故事")
]

# You can pass existing history
response, history = model.chat(tokenizer, "继续", history=history)

Generation Parameters

Customize text generation with GenerationConfig:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# Create custom generation config
generation_config = GenerationConfig.from_pretrained(
    "Qwen/Qwen-7B-Chat", 
    trust_remote_code=True
)

# Customize parameters
generation_config.top_p = 0.8
generation_config.temperature = 0.7
generation_config.top_k = 20
generation_config.max_new_tokens = 512
generation_config.repetition_penalty = 1.1

# Use in inference
response, history = model.chat(
    tokenizer,
    "Write a creative story",
    history=None,
    generation_config=generation_config
)

Common Parameters

ParameterDefaultDescription
max_new_tokens512Maximum number of tokens to generate
temperature1.0Controls randomness (lower = more deterministic)
top_p0.8Nucleus sampling threshold
top_k0Top-k sampling (0 = disabled)
repetition_penalty1.0Penalty for repeating tokens

Base Model Usage

For non-chat models, use the standard .generate() method:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B",
    device_map="auto",
    trust_remote_code=True
).eval()

# Prepare input
prompt = "蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是"
inputs = tokenizer(prompt, return_tensors='pt')
inputs = inputs.to(model.device)

# Generate
output = model.generate(**inputs, max_new_tokens=50)
result = tokenizer.decode(output.cpu()[0], skip_special_tokens=True)
print(result)

Quantized Models

Load pre-quantized models for reduced memory usage:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4", 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()

response, history = model.chat(tokenizer, "Hi", history=None)
print(response)
Quantized models provide significant memory savings with minimal performance degradation. See the quantization guide for more details.

KV Cache Quantization

Enable KV cache quantization for higher throughput:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_cache_quantization=True,
    use_cache_kernel=True,
    use_flash_attn=False  # Cannot use with cache quantization
).eval()

response, history = model.chat(tokenizer, "Hello!", history=None)
print(response)
KV cache quantization and flash attention cannot be enabled simultaneously. If both are set to True, flash attention will be automatically disabled.

Device Mapping

Control which devices the model uses:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)

# Automatic device mapping (recommended)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# Specific GPU
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="cuda:0",
    trust_remote_code=True
).eval()

# CPU only
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="cpu",
    trust_remote_code=True
).eval()

Performance Tips

Install flash-attention for 30-40% speedup:
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
No code changes needed - Qwen automatically detects and uses flash attention.
On supported hardware, BF16 provides the best balance:
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    bf16=True
).eval()
Adjust generation parameters for your use case:
# Faster, more focused generation
config.temperature = 0.5
config.top_p = 0.7
config.max_new_tokens = 256

# More creative, diverse generation
config.temperature = 1.0
config.top_p = 0.95
config.max_new_tokens = 1024

Next Steps

ModelScope Integration

Use Qwen with ModelScope platform

Batch Inference

Process multiple requests efficiently

Streaming

Stream tokens as they’re generated

Multi-GPU

Scale across multiple GPUs

Build docs developers (and LLMs) love