Using Qwen with ModelScope

Overview

ModelScope is an open-source platform for Model-as-a-Service (MaaS), providing flexible and cost-effective model services to AI developers. Qwen models are fully integrated with ModelScope, offering an alternative to Hugging Face for model hosting and inference.

Why ModelScope?

Network Optimization

Better download speeds in certain regions, especially Asia

Local Mirror

Alternative when Hugging Face is unavailable

MaaS Platform

Integrated model service infrastructure

Same API

Compatible interface with Transformers

Installation

Install ModelScope and required dependencies:

pip install modelscope
pip install transformers>=4.32.0
pip install torch>=2.0.0

Direct ModelScope Usage

Use ModelScope’s API for loading and running Qwen models:

from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig

# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained(
    "qwen/Qwen-7B-Chat", 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen-7B-Chat", 
    device_map="auto", 
    trust_remote_code=True, 
    fp16=True
).eval()

# Configure generation parameters
model.generation_config = GenerationConfig.from_pretrained(
    "qwen/Qwen-7B-Chat", 
    trust_remote_code=True
)

# First turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)

# Second turn
response, history = model.chat(tokenizer, "浙江的省会在哪里？", history=history) 
print(response)

# Third turn
response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)
print(response)

Note the model naming convention: ModelScope uses qwen/Qwen-7B-Chat while Hugging Face uses Qwen/Qwen-7B-Chat (capital Q).

Download and Use Locally

Download models from ModelScope and use them with Transformers:

Download Model

Use snapshot_download to fetch the model:

from modelscope import snapshot_download

# Download to local directory
model_dir = snapshot_download('qwen/Qwen-7B-Chat')
print(f"Model downloaded to: {model_dir}")

Available models:

qwen/Qwen-1_8B-Chat
qwen/Qwen-7B-Chat
qwen/Qwen-14B-Chat
qwen/Qwen-72B-Chat

Load with Transformers

Load the downloaded model using Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load from local directory
tokenizer = AutoTokenizer.from_pretrained(
    model_dir, 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
).eval()

Run Inference

Use the model as normal:

response, history = model.chat(tokenizer, "你好", history=None)
print(response)

Complete Example

Here’s a complete workflow downloading from ModelScope and using with Transformers:

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

# Step 1: Download model checkpoint to local directory
model_dir = snapshot_download('qwen/Qwen-14B-Chat')

# Step 2: Load model and tokenizer from local directory
# trust_remote_code is still required for custom model code
tokenizer = AutoTokenizer.from_pretrained(
    model_dir, 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
).eval()

# Step 3: Run inference
response, history = model.chat(
    tokenizer, 
    "请介绍一下中国的历史", 
    history=None
)
print(response)

response, history = model.chat(
    tokenizer, 
    "能详细说说唐朝吗？", 
    history=history
)
print(response)

Model Variants on ModelScope

All Qwen model variants are available on ModelScope:

Chat Models
Base Models
Quantized Models

Fine-tuned for conversation:

qwen/Qwen-1_8B-Chat
qwen/Qwen-7B-Chat
qwen/Qwen-14B-Chat
qwen/Qwen-72B-Chat

from modelscope import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "qwen/Qwen-7B-Chat", 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

Pre-trained base models:

qwen/Qwen-1_8B
qwen/Qwen-7B
qwen/Qwen-14B
qwen/Qwen-72B

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = snapshot_download('qwen/Qwen-7B')
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
).eval()

Int4 and Int8 quantized versions:

qwen/Qwen-7B-Chat-Int4
qwen/Qwen-7B-Chat-Int8
qwen/Qwen-14B-Chat-Int4
qwen/Qwen-14B-Chat-Int8
qwen/Qwen-72B-Chat-Int4
qwen/Qwen-72B-Chat-Int8

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = snapshot_download('qwen/Qwen-7B-Chat-Int4')
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
).eval()

Precision Options

ModelScope supports the same precision options as Transformers:

from modelscope import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "qwen/Qwen-7B-Chat", 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    fp16=True
).eval()

Generation Configuration

Customize generation behavior with ModelScope’s GenerationConfig:

from modelscope import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained(
    "qwen/Qwen-7B-Chat", 
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# Load and customize generation config
generation_config = GenerationConfig.from_pretrained(
    "qwen/Qwen-7B-Chat", 
    trust_remote_code=True
)

# Adjust parameters
generation_config.top_p = 0.8
generation_config.temperature = 0.7
generation_config.max_new_tokens = 512
generation_config.repetition_penalty = 1.05

# Apply to model
model.generation_config = generation_config

# Use in chat
response, history = model.chat(
    tokenizer, 
    "讲一个有趣的故事", 
    history=None
)
print(response)

Caching and Resume

ModelScope supports resume download for interrupted transfers:

from modelscope import snapshot_download

# Download with automatic resume on interruption
model_dir = snapshot_download(
    'qwen/Qwen-7B-Chat',
    cache_dir='/path/to/cache',  # Specify cache directory
    revision='v1.0.0'  # Pin to specific version
)

print(f"Model cached at: {model_dir}")

Comparing ModelScope and Hugging Face

Feature	ModelScope	Hugging Face
Model naming	`qwen/Qwen-7B-Chat`	`Qwen/Qwen-7B-Chat`
Download speed	Faster in Asia	Faster in US/EU
API compatibility	Same as Transformers	Native Transformers
Cache location	`~/.cache/modelscope`	`~/.cache/huggingface`
Resume download	✅ Yes	✅ Yes

Hybrid Approach

Download from ModelScope but load config from Hugging Face:

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Download model from ModelScope
model_dir = snapshot_download('qwen/Qwen-7B-Chat')

# Load model and tokenizer locally
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
).eval()

# But load generation config from Hugging Face (if you prefer)
model.generation_config = GenerationConfig.from_pretrained(
    "Qwen/Qwen-7B-Chat",  # Note: Capital Q for HF
    trust_remote_code=True
)

response, history = model.chat(tokenizer, "你好", history=None)
print(response)

Troubleshooting

Download is slow

Try specifying a different mirror or check your network:

import os
os.environ['MODELSCOPE_CACHE'] = '/path/to/fast/storage'

from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen-7B-Chat')

Model not found

Ensure you’re using the correct naming convention:

ModelScope: qwen/Qwen-7B-Chat (lowercase q)
Hugging Face: Qwen/Qwen-7B-Chat (capital Q)

trust_remote_code error

Always include trust_remote_code=True when loading Qwen models:

tokenizer = AutoTokenizer.from_pretrained(
    "qwen/Qwen-7B-Chat", 
    trust_remote_code=True  # Required!
)

Next Steps

Batch Inference

Process multiple inputs efficiently

Streaming

Stream responses in real-time

Multi-GPU

Scale across multiple GPUs

Transformers Guide

Deep dive into Transformers usage

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

Using Qwen with ModelScope

Overview

Why ModelScope?

Network Optimization

Local Mirror

MaaS Platform

Same API

Installation

Direct ModelScope Usage

Download and Use Locally

Complete Example

Model Variants on ModelScope

Precision Options

Generation Configuration

Caching and Resume

Comparing ModelScope and Hugging Face

Hybrid Approach

Troubleshooting

Next Steps

Batch Inference

Streaming

Multi-GPU

Transformers Guide

Build docs developers (and LLMs) love

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

​Overview

​Why ModelScope?

Network Optimization

Local Mirror

MaaS Platform

Same API

​Installation

​Direct ModelScope Usage

​Download and Use Locally

​Complete Example

​Model Variants on ModelScope

​Precision Options

​Generation Configuration

​Caching and Resume

​Comparing ModelScope and Hugging Face

​Hybrid Approach

​Troubleshooting

​Next Steps

Batch Inference

Streaming

Multi-GPU

Transformers Guide

Build docs developers (and LLMs) love

Overview

Why ModelScope?

Installation

Direct ModelScope Usage

Download and Use Locally

Complete Example

Model Variants on ModelScope

Precision Options

Generation Configuration

Caching and Resume

Comparing ModelScope and Hugging Face

Hybrid Approach

Troubleshooting

Next Steps