Skip to main content

Overview

The Qwen model loading API provides methods to load pre-trained models and tokenizers from Hugging Face or local paths. All models use the standard Transformers library interfaces.

Load Model

Load a Qwen model for causal language modeling:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    resume_download=True
).eval()

Parameters

model_name_or_path
str
required
Model checkpoint name (e.g., “Qwen/Qwen-7B-Chat”) or local path to model directory
device_map
str | dict
default:"None"
Device allocation strategy:
  • "auto": Automatically distribute model across available devices
  • "cpu": Load model to CPU only
  • "cuda": Load model to GPU
  • Dictionary mapping layers to specific devices
trust_remote_code
bool
default:"False"
Allow execution of custom modeling code from the model repository. Required for Qwen models.
resume_download
bool
default:"False"
Resume incomplete downloads from Hugging Face Hub
torch_dtype
torch.dtype
default:"None"
Data type for model weights (e.g., torch.float16, torch.bfloat16)
low_cpu_mem_usage
bool
default:"False"
Reduce CPU memory usage during model loading (useful for large models)
quantization_config
GPTQConfig | BitsAndBytesConfig
default:"None"
Configuration for model quantization (4-bit, 8-bit)

Returns

model
AutoModelForCausalLM
Loaded Qwen model ready for inference or fine-tuning

Load Tokenizer

Load the tokenizer associated with a Qwen model:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    resume_download=True
)

Parameters

pretrained_model_name_or_path
str
required
Model checkpoint name or path. Should match the model being loaded.
trust_remote_code
bool
default:"False"
Allow execution of custom tokenizer code. Required for Qwen models.
resume_download
bool
default:"False"
Resume incomplete downloads from Hugging Face Hub
padding_side
str
default:"right"
Side on which padding tokens are added:
  • "right": Pad on the right (recommended for training)
  • "left": Pad on the left (recommended for generation)
use_fast
bool
default:"True"
Use fast tokenizer implementation if available
model_max_length
int
default:"None"
Maximum sequence length for tokenization

Returns

tokenizer
AutoTokenizer
Loaded tokenizer with special tokens configured for Qwen

Load Generation Config

Load generation configuration from a checkpoint:
from transformers.generation import GenerationConfig

config = GenerationConfig.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    resume_download=True
)

Parameters

pretrained_model_name
str
required
Model checkpoint name or path containing generation_config.json
trust_remote_code
bool
default:"False"
Allow execution of custom configuration code
resume_download
bool
default:"False"
Resume incomplete downloads

Returns

config
GenerationConfig
Generation configuration with model-specific defaults

Complete Loading Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    resume_download=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    resume_download=True
).eval()

# Load generation config
generation_config = GenerationConfig.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    resume_download=True
)

# Assign to model
model.generation_config = generation_config

CPU-Only Loading

For environments without GPU:
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="cpu",
    trust_remote_code=True,
    resume_download=True
).eval()

Quantized Loading

Load with 4-bit quantization for reduced memory usage:
from transformers import GPTQConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    quantization_config=GPTQConfig(bits=4, disable_exllama=True)
).eval()

Build docs developers (and LLMs) love