Model Loading API

Overview

The Qwen model loading API provides methods to load pre-trained models and tokenizers from Hugging Face or local paths. All models use the standard Transformers library interfaces.

Load Model

Load a Qwen model for causal language modeling:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    resume_download=True
).eval()

Parameters

model_name_or_path

str

required

Model checkpoint name (e.g., “Qwen/Qwen-7B-Chat”) or local path to model directory

device_map

str | dict

default:"None"

Device allocation strategy:

"auto": Automatically distribute model across available devices
"cpu": Load model to CPU only
"cuda": Load model to GPU
Dictionary mapping layers to specific devices

trust_remote_code

bool

default:"False"

Allow execution of custom modeling code from the model repository. Required for Qwen models.

resume_download

bool

default:"False"

Resume incomplete downloads from Hugging Face Hub

torch_dtype

torch.dtype

default:"None"

Data type for model weights (e.g., torch.float16, torch.bfloat16)

low_cpu_mem_usage

bool

default:"False"

Reduce CPU memory usage during model loading (useful for large models)

quantization_config

GPTQConfig | BitsAndBytesConfig

default:"None"

Configuration for model quantization (4-bit, 8-bit)

Returns

model

AutoModelForCausalLM

Loaded Qwen model ready for inference or fine-tuning

Load Tokenizer

Load the tokenizer associated with a Qwen model:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    resume_download=True
)

Parameters

pretrained_model_name_or_path

str

required

Model checkpoint name or path. Should match the model being loaded.

trust_remote_code

bool

default:"False"

Allow execution of custom tokenizer code. Required for Qwen models.

resume_download

bool

default:"False"

Resume incomplete downloads from Hugging Face Hub

padding_side

str

default:"right"

Side on which padding tokens are added:

"right": Pad on the right (recommended for training)
"left": Pad on the left (recommended for generation)

use_fast

bool

default:"True"

Use fast tokenizer implementation if available

model_max_length

int

default:"None"

Maximum sequence length for tokenization

Returns

tokenizer

AutoTokenizer

Loaded tokenizer with special tokens configured for Qwen

Load Generation Config

Load generation configuration from a checkpoint:

from transformers.generation import GenerationConfig

config = GenerationConfig.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    resume_download=True
)

Parameters

pretrained_model_name

str

required

Model checkpoint name or path containing generation_config.json

trust_remote_code

bool

default:"False"

Allow execution of custom configuration code

resume_download

bool

default:"False"

Resume incomplete downloads

Returns

config

GenerationConfig

Generation configuration with model-specific defaults

Complete Loading Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    resume_download=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    resume_download=True
).eval()

# Load generation config
generation_config = GenerationConfig.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    resume_download=True
)

# Assign to model
model.generation_config = generation_config

CPU-Only Loading

For environments without GPU:

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="cpu",
    trust_remote_code=True,
    resume_download=True
).eval()

Quantized Loading

Load with 4-bit quantization for reduced memory usage:

from transformers import GPTQConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    quantization_config=GPTQConfig(bits=4, disable_exllama=True)
).eval()

Model API

OpenAI Compatible API

Training API

Model Loading API

Overview

Load Model

Parameters

Returns

Load Tokenizer

Parameters

Returns

Load Generation Config

Parameters

Returns

Complete Loading Example

CPU-Only Loading

Quantized Loading

Build docs developers (and LLMs) love

Model API

OpenAI Compatible API

Training API

​Overview

​Load Model

​Parameters

​Returns

​Load Tokenizer

​Parameters

​Returns

​Load Generation Config

​Parameters

​Returns

​Complete Loading Example

​CPU-Only Loading

​Quantized Loading

Build docs developers (and LLMs) love

Overview

Load Model

Parameters

Returns

Load Tokenizer

Parameters

Returns

Load Generation Config

Parameters

Returns

Complete Loading Example

CPU-Only Loading

Quantized Loading