Skip to main content
Get up and running with Qwen models in just a few minutes. This guide will help you load a model and start generating text or having conversations.

Prerequisites

Before you begin, ensure you have:
  • Python 3.8 or higher
  • PyTorch 1.12 or higher (2.0+ recommended)
  • CUDA 11.4 or higher (for GPU users)
  • At least 8GB GPU memory for Qwen-7B (16GB+ recommended)

Installation

Install the required dependencies:
pip install transformers>=4.32.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy
For detailed installation instructions including flash-attention setup, see the Installation guide.

Quick Start with Qwen-Chat

The fastest way to start using Qwen is with the chat models. Here’s a simple example:
1

Import Libraries

Import the necessary libraries from Transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
2

Load Model and Tokenizer

Load the Qwen-7B-Chat model and tokenizer:
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

# Load model with automatic precision selection
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()
The device_map="auto" parameter automatically handles device placement and multi-GPU distribution.
3

Start Chatting

Use the chat interface to have a conversation:
# First conversation turn
response, history = model.chat(
    tokenizer,
    "Hello! Can you introduce yourself?",
    history=None
)
print(response)
# Output: 你好!很高兴为你提供帮助。

# Continue the conversation
response, history = model.chat(
    tokenizer,
    "What can you help me with?",
    history=history
)
print(response)
The history parameter maintains conversation context across multiple turns.

Available Models

Qwen provides models in various sizes to suit different needs:

Qwen-1.8B-Chat

Smallest model, fastest inference, lowest memory requirements (2.9GB GPU)

Qwen-7B-Chat

Balanced performance and efficiency (8.2GB GPU)

Qwen-14B-Chat

High performance for complex tasks (13.0GB GPU)

Qwen-72B-Chat

Best performance, highest capabilities (requires 2xA100 or 48.9GB with Int4)

Model Precision Options

Qwen supports multiple precision formats for different hardware and performance needs:
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    bf16=True
).eval()
CPU-only inference is significantly slower than GPU inference. For production use, we recommend GPU acceleration or consider using qwen.cpp for optimized CPU inference.

Using Base Models

If you need the base language model (without chat alignment) for completion tasks:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B",
    device_map="auto",
    trust_remote_code=True
).eval()

# Generate text completion
inputs = tokenizer(
    '蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是',
    return_tensors='pt'
)
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Using ModelScope (Alternative)

For users in regions with better access to ModelScope:
from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained(
    "qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    fp16=True
).eval()

response, history = model.chat(tokenizer, "你好", history=None)
print(response)

Running the CLI Demo

Qwen includes a command-line interactive chat demo:
python cli_demo.py -c Qwen/Qwen-7B-Chat
The CLI demo supports several commands:
  • :help or :h - Show help message
  • :exit or :q - Exit the demo
  • :clear or :cl - Clear screen
  • :clear-his or :clh - Clear conversation history
  • :history or :his - Show conversation history
  • :seed <N> - Set random seed
  • :conf - Show current generation config
  • :conf <key>=<value> - Change generation config
  • :reset-conf - Reset generation config

CLI Demo Options

# Use CPU only
python cli_demo.py -c Qwen/Qwen-7B-Chat --cpu-only

# Set random seed
python cli_demo.py -c Qwen/Qwen-7B-Chat -s 42

# Use local checkpoint
python cli_demo.py -c /path/to/local/checkpoint

Using Quantized Models

For lower memory requirements and faster inference, use quantized models:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()

response, history = model.chat(tokenizer, "Hi", history=None)
print(response)
Quantized models achieve minimal performance degradation while significantly reducing memory requirements. See the Quantization section for detailed performance comparisons.

Memory Requirements

Here’s a quick reference for GPU memory requirements when generating 2048 tokens:
Model SizeBF16Int8Int4
Qwen-1.8B4.23GB3.48GB2.91GB
Qwen-7B16.99GB11.20GB8.21GB
Qwen-14B30.15GB18.81GB13.01GB
Qwen-72B144.69GB (2xA100)81.27GB (2xA100)48.86GB

Troubleshooting

Make sure you’re using transformers>=4.32.0. Update with:
pip install --upgrade transformers
Try these solutions:
  1. Use a smaller model (e.g., Qwen-1.8B or Qwen-7B)
  2. Use quantized models (Int4 or Int8)
  3. Reduce the maximum sequence length
  4. Enable gradient checkpointing for training
To improve inference speed:
  1. Install flash-attention (see Installation)
  2. Use quantized models for faster generation
  3. Consider using vLLM for production deployments
  4. Use GPU instead of CPU
If you have trouble downloading from Hugging Face:
from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

# Download to local directory
model_dir = snapshot_download('qwen/Qwen-7B-Chat')

# Load from local directory
tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
).eval()

Next Steps

Installation

Complete installation guide with flash-attention and Docker setup

Model Selection

Choose the right model for your use case

Inference Guide

Learn advanced inference techniques

Fine-tuning

Customize Qwen for your specific tasks

Build docs developers (and LLMs) love