Quickstart

Get up and running with Qwen models in just a few minutes. This guide will help you load a model and start generating text or having conversations.

Prerequisites

Before you begin, ensure you have:

Python 3.8 or higher
PyTorch 1.12 or higher (2.0+ recommended)
CUDA 11.4 or higher (for GPU users)
At least 8GB GPU memory for Qwen-7B (16GB+ recommended)

Installation

Install the required dependencies:

pip install transformers>=4.32.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy

For detailed installation instructions including flash-attention setup, see the Installation guide.

Quick Start with Qwen-Chat

The fastest way to start using Qwen is with the chat models. Here’s a simple example:

Import Libraries

Import the necessary libraries from Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

Load Model and Tokenizer

Load the Qwen-7B-Chat model and tokenizer:

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

# Load model with automatic precision selection
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

The device_map="auto" parameter automatically handles device placement and multi-GPU distribution.

Start Chatting

Use the chat interface to have a conversation:

# First conversation turn
response, history = model.chat(
    tokenizer,
    "Hello! Can you introduce yourself?",
    history=None
)
print(response)
# Output: 你好！很高兴为你提供帮助。

# Continue the conversation
response, history = model.chat(
    tokenizer,
    "What can you help me with?",
    history=history
)
print(response)

The history parameter maintains conversation context across multiple turns.

Available Models

Qwen provides models in various sizes to suit different needs:

Qwen-1.8B-Chat

Smallest model, fastest inference, lowest memory requirements (2.9GB GPU)

Qwen-7B-Chat

Balanced performance and efficiency (8.2GB GPU)

Qwen-14B-Chat

High performance for complex tasks (13.0GB GPU)

Qwen-72B-Chat

Best performance, highest capabilities (requires 2xA100 or 48.9GB with Int4)

Model Precision Options

Qwen supports multiple precision formats for different hardware and performance needs:

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    bf16=True
).eval()

CPU-only inference is significantly slower than GPU inference. For production use, we recommend GPU acceleration or consider using qwen.cpp for optimized CPU inference.

Using Base Models

If you need the base language model (without chat alignment) for completion tasks:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B",
    device_map="auto",
    trust_remote_code=True
).eval()

# Generate text completion
inputs = tokenizer(
    '蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是',
    return_tensors='pt'
)
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Using ModelScope (Alternative)

For users in regions with better access to ModelScope:

from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained(
    "qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    fp16=True
).eval()

response, history = model.chat(tokenizer, "你好", history=None)
print(response)

Running the CLI Demo

Qwen includes a command-line interactive chat demo:

python cli_demo.py -c Qwen/Qwen-7B-Chat

CLI Demo Commands

The CLI demo supports several commands:

:help or :h - Show help message
:exit or :q - Exit the demo
:clear or :cl - Clear screen
:clear-his or :clh - Clear conversation history
:history or :his - Show conversation history
:seed <N> - Set random seed
:conf - Show current generation config
:conf <key>=<value> - Change generation config
:reset-conf - Reset generation config

CLI Demo Options

# Use CPU only
python cli_demo.py -c Qwen/Qwen-7B-Chat --cpu-only

# Set random seed
python cli_demo.py -c Qwen/Qwen-7B-Chat -s 42

# Use local checkpoint
python cli_demo.py -c /path/to/local/checkpoint

Using Quantized Models

For lower memory requirements and faster inference, use quantized models:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()

response, history = model.chat(tokenizer, "Hi", history=None)
print(response)

Quantized models achieve minimal performance degradation while significantly reducing memory requirements. See the Quantization section for detailed performance comparisons.

Memory Requirements

Here’s a quick reference for GPU memory requirements when generating 2048 tokens:

Model Size	BF16	Int8	Int4
Qwen-1.8B	4.23GB	3.48GB	2.91GB
Qwen-7B	16.99GB	11.20GB	8.21GB
Qwen-14B	30.15GB	18.81GB	13.01GB
Qwen-72B	144.69GB (2xA100)	81.27GB (2xA100)	48.86GB

Troubleshooting

ImportError: trust_remote_code

Make sure you’re using transformers>=4.32.0. Update with:

pip install --upgrade transformers

CUDA Out of Memory

Try these solutions:

Use a smaller model (e.g., Qwen-1.8B or Qwen-7B)
Use quantized models (Int4 or Int8)
Reduce the maximum sequence length
Enable gradient checkpointing for training

Slow Inference Speed

To improve inference speed:

Install flash-attention (see Installation)
Use quantized models for faster generation
Consider using vLLM for production deployments
Use GPU instead of CPU

Network Issues Downloading Models

If you have trouble downloading from Hugging Face:

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

# Download to local directory
model_dir = snapshot_download('qwen/Qwen-7B-Chat')

# Load from local directory
tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
).eval()

Next Steps

Installation

Complete installation guide with flash-attention and Docker setup

Model Selection

Choose the right model for your use case

Inference Guide

Learn advanced inference techniques

Fine-tuning

Customize Qwen for your specific tasks

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

Prerequisites

Installation

Quick Start with Qwen-Chat

Available Models

Qwen-1.8B-Chat

Qwen-7B-Chat

Qwen-14B-Chat

Qwen-72B-Chat

Model Precision Options

Using Base Models

Using ModelScope (Alternative)

Running the CLI Demo

CLI Demo Options

Using Quantized Models

Memory Requirements

Troubleshooting

Next Steps

Installation

Model Selection

Inference Guide

Fine-tuning

Build docs developers (and LLMs) love

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Prerequisites

​Installation

​Quick Start with Qwen-Chat

​Available Models

Qwen-1.8B-Chat

Qwen-7B-Chat

Qwen-14B-Chat

Qwen-72B-Chat

​Model Precision Options

​Using Base Models

​Using ModelScope (Alternative)

​Running the CLI Demo

​CLI Demo Options

​Using Quantized Models

​Memory Requirements

​Troubleshooting

​Next Steps

Installation

Model Selection

Inference Guide

Fine-tuning

Build docs developers (and LLMs) love

Prerequisites

Installation

Quick Start with Qwen-Chat

Available Models

Model Precision Options

Using Base Models

Using ModelScope (Alternative)

Running the CLI Demo

CLI Demo Options

Using Quantized Models

Memory Requirements

Troubleshooting

Next Steps