Skip to main content

Base Pretrained Models

Qwen base models are pretrained language models that serve as the foundation for various downstream tasks. These models have been trained on massive amounts of multilingual text data and are ideal for further fine-tuning.

Model Architecture

Qwen models use a transformer-based decoder-only architecture with several optimizations:

Core Components

  • Architecture Type: Decoder-only Transformer (similar to LLaMA)
  • Positional Encoding: Rotary Position Embedding (RoPE)
  • Activation Function: SwiGLU (instead of ReLU)
  • Normalization: RMSNorm (instead of LayerNorm)
  • Attention Mechanism: Flash Attention 2 support
  • Embeddings: Untied input and output embeddings
  • Bias: No biases except for QKV in attention

Technical Specifications

Model Configuration
  • Parameters: 7 billion
  • Layers: 32
  • Hidden Dimension: 4096
  • Attention Heads: 32
  • Vocabulary Size: 151,851 tokens
  • Context Length: 32K (original: 2048, extended: 8192+)
  • Training Tokens: 2.4 trillion
Training Details
  • Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁶)
  • Batch Size: 2048 sequences (4M+ tokens per step)
  • Learning Rate: Peak 3×10⁻⁴, cosine schedule
  • Warm-up Steps: 2000
  • Weight Decay: 0.1
  • Gradient Clipping: 1.0
  • Precision: BFloat16 mixed precision

Training Data

Qwen models are pretrained on a diverse multilingual dataset:

Data Sources

  • Web Documents: Publicly available web content
  • Code Files: Programming repositories and code samples
  • Mathematical Data: Including RFT data from gsm8k-ScRel
  • Total Volume: 2.2T to 3.0T tokens (model-dependent)

Language Coverage

Primary focus on Chinese and English, with support for multilingual content including Japanese, Korean, Arabic, Thai, Vietnamese, Indonesian, Polish, Russian, Dutch, Portuguese, Italian, German, Spanish, and French.

Data Processing

  • Quality Filtering: Ensemble of models to exclude low-quality content
  • Safety Filtering: NSFW content removal
  • Deduplication: Global fuzzy deduplication
  • Optimization: Multiple ablation experiments to optimize data mix

Benchmark Performance

Chinese Language Understanding

C-Eval (5-shot) - Testing common-sense capability in Chinese across 52 subjects:
ModelAverageSTEMSocial SciencesHumanitiesOthers
LLaMA2-7B32.5----
ChatGLM2-6B51.748.660.551.349.8
InternLM-7B53.448.067.455.445.8
Qwen-7B63.552.874.163.155.2
Qwen-14B72.1----
Qwen-72B83.3----

English Language Understanding

MMLU (5-shot) - Comprehensive English evaluation across 57 academic subjects:
ModelAverageSTEMSocial SciencesHumanitiesOthers
LLaMA-7B35.130.538.334.038.1
LLaMA2-7B46.836.451.242.952.2
Baichuan-7B42.335.648.938.448.1
ChatGLM2-6B47.941.254.443.754.5
InternLM-7B51.0----
Qwen-7B58.247.665.951.564.7
Qwen-14B66.3----
Qwen-72B77.4----

Coding Capability

HumanEval (Pass@1) - Python coding benchmark:
ModelPass@1
LLaMA-7B10.5
LLaMA2-7B12.8
Baichuan-7B9.2
ChatGLM2-6B9.2
InternLM-7B10.4
Qwen-7B29.9
Qwen-14B32.3
Qwen-72B35.4

Mathematical Reasoning

GSM8K (8-shot) - Grade school math problems:
ModelAccuracy
LLaMA-7B11.0
LLaMA2-7B16.7
Baichuan-7B9.7
InternLM-7B31.2
ChatGLM2-6B32.4
Qwen-7B51.7
Qwen-14B61.3
Qwen-72B78.9
MATH (4-shot) - Advanced mathematical problem solving:
ModelAccuracy
LLaMA2-7B3.3
InternLM-7B6.3
ChatGLM2-6B6.5
Qwen-7B11.6
Qwen-14B24.8
Qwen-72B35.2

Translation

WMT22 (5-shot BLEU) - Translation quality:
ModelAveragezh→enen→zh
LLaMA-7B12.716.78.7
LLaMA2-7B19.921.917.9
Baichuan-7B24.622.626.6
InternLM-7B11.89.014.5
Qwen-7B27.524.330.6

Long Context Support

Qwen base models support extended context lengths through training-free methods:

Extension Techniques

Dynamically adjusts rotary position embeddings to support longer sequences without additional training.Configuration: Set use_dynamic_ntk=true in config.json
Applies logarithmic scaling to attention scores for improved long-context performance.Configuration: Set use_logn_attn=true in config.json
Reduces memory usage by limiting attention to local windows for very long sequences.Configuration: Enable via inference parameters

Perplexity on arXiv (Qwen-7B)

Method102420484096819216384
Baseline4.233.7839.35469.812645.09
+ dynamic_ntk4.233.783.593.665.71
+ dynamic_ntk + logn4.233.783.583.564.62
+ dynamic_ntk + logn + local_attn4.233.783.583.494.32
With all techniques enabled, Qwen-7B can extend from 2048 to 16384 tokens with minimal perplexity degradation.

Tokenizer

Qwen uses a custom tokenizer optimized for multilingual efficiency:

Features

  • Library: Based on tiktoken (OpenAI’s tokenizer)
  • Vocabulary Size: 151,851 tokens
  • Number Encoding: Single-digit segmentation for better arithmetic
  • Efficiency: High compression rate for Chinese, English, and code
  • Multilingual: Native support for 100+ languages without vocabulary expansion

Compression Efficiency

Compared to XLM-R (baseline=1.0), Qwen achieves high compression rates:
  • Chinese: ~1.5× more efficient
  • English: ~1.3× more efficient
  • Code: ~1.4× more efficient
  • Other Languages: 1.2-1.5× more efficient (Thai, Hebrew, Arabic, Korean, Vietnamese, Japanese, etc.)

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B",
    device_map="auto",
    trust_remote_code=True
).eval()

# Generate text
inputs = tokenizer(
    '蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是',
    return_tensors='pt'
)
inputs = inputs.to(model.device)

pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Output: 蒙古国的首都是乌兰巴托(Ulaanbaatar)
#         冰岛的首都是雷克雅未克(Reykjavik)
#         埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...

Hardware Requirements

Inference Memory Usage (Generating 2048 tokens)

PrecisionGPU MemorySpeed (tokens/s)
BF164.23GB54.09
Int83.48GB55.56
Int42.91GB71.07
Profiling conducted on A100-SXM4-80G GPU with PyTorch 2.0.1, CUDA 11.8, and Flash Attention 2.

Model Downloads

Qwen-1.8B

🤗 Hugging Face | 🤖 ModelScope

Qwen-7B

🤗 Hugging Face | 🤖 ModelScope

Qwen-14B

🤗 Hugging Face | 🤖 ModelScope

Qwen-72B

🤗 Hugging Face | 🤖 ModelScope

Next Steps

Chat Models

Explore conversation-aligned models

Fine-tuning

Learn how to fine-tune base models

Quantization

Reduce memory usage with quantization

Model Selection

Choose the right model size

Build docs developers (and LLMs) love