Skip to main content
This guide covers the hardware requirements for running Qwen models across different sizes, precision levels, and use cases.

Minimum Requirements

Software Prerequisites

  • Python: 3.8 and above
  • PyTorch: 1.12 and above (2.0+ recommended)
  • Transformers: 4.32.0 and above
  • CUDA: 11.4 and above (for GPU users)

Optional Dependencies

Flash Attention (recommended for fp16/bf16):
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
Flash Attention 2 is now supported and provides significant speed improvements. It requires NVIDIA GPUs with Turing, Ampere, Ada, or Hopper architecture (e.g., H100, A100, RTX 3090, T4, RTX 2080).

GPU Requirements by Model Size

Qwen-1.8B

PrecisionInference MemoryMinimum GPUFinetuning (Q-LoRA)Generating 2048 Tokens (Int4)
BF16~4.23GBRTX 3060 Ti5.8GB2.9GB
Int8~3.48GBRTX 3060--
Int4~2.91GBGTX 1660-2.9GB
Recommended GPU: RTX 3060 12GB or better

Qwen-7B

PrecisionInference MemoryMinimum GPUFinetuning (Q-LoRA)Generating 2048 Tokens (Int4)
BF16~16.99GBRTX 309011.5GB8.2GB
Int8~11.20GBRTX 3080--
Int4~8.21GBRTX 3070-8.2GB
Recommended GPU: RTX 3090 24GB, RTX 4090 24GB, or A100 40GB

Qwen-14B

PrecisionInference MemoryMinimum GPUFinetuning (Q-LoRA)Generating 2048 Tokens (Int4)
BF16~30.15GBA100 40GB18.7GB13.0GB
Int8~18.81GBRTX 3090--
Int4~13.01GBRTX 3090-13.0GB
Recommended GPU: A100 40GB or A100 80GB

Qwen-72B

PrecisionInference MemoryMinimum GPUFinetuning (Q-LoRA)Generating 2048 Tokens (Int4)
BF16~144.69GB2× A100 80GB61.4GB48.9GB
Int8~81.27GB2× A100 80GB--
Int4~48.86GBA100 80GB-48.9GB
Recommended GPU: 2× A100 80GB or 4× A100 40GB

Inference Performance

Benchmarked on A100-SXM4-80G GPU with PyTorch 2.0.1, CUDA 11.8, Flash Attention 2:
Model SizeQuantizationSpeed (Tokens/s)GPU Memory
1.8BBF1654.094.23GB
1.8BInt855.563.48GB
1.8BInt471.072.91GB
7BBF1640.9316.99GB
7BInt837.4711.20GB
7BInt450.098.21GB
14BBF1632.2230.15GB
14BInt829.2818.81GB
14BInt438.7213.01GB
72BBF168.48144.69GB (2×A100)
72BInt89.0581.27GB (2×A100)
72BInt411.3248.86GB
72B + vLLMBF1617.602×A100
Inference speed is averaged over encoded and generated tokens.

Finetuning Memory Requirements

Profiling on single A100-SXM4-80G with CUDA 11.8, PyTorch 2.0, Flash Attention 2: Batch size: 1, Gradient accumulation: 8

Qwen-1.8B

Method256 tokens512 tokens1024 tokens2048 tokens4096 tokens8192 tokens
LoRA6.7G / 1.0s7.4G / 1.0s8.4G / 1.1s11.0G / 1.7s16.2G / 3.3s21.8G / 6.8s
LoRA (emb)13.7G / 1.0s14.0G / 1.0s14.0G / 1.1s15.1G / 1.8s19.7G / 3.4s27.7G / 7.0s
Q-LoRA5.8G / 1.4s6.0G / 1.4s6.6G / 1.4s7.8G / 2.0s10.2G / 3.4s15.8G / 6.5s
Full-parameter43.5G / 2.1s43.5G / 2.2s43.5G / 2.2s43.5G / 2.3s47.1G / 2.8s48.3G / 5.6s

Qwen-7B

Method256 tokens512 tokens1024 tokens2048 tokens4096 tokens8192 tokens
LoRA20.1G / 1.2s20.4G / 1.5s21.5G / 2.8s23.8G / 5.2s29.7G / 10.1s36.6G / 21.3s
LoRA (emb)33.7G / 1.4s34.1G / 1.6s35.2G / 2.9s35.1G / 5.3s39.2G / 10.3s48.5G / 21.7s
Q-LoRA11.5G / 3.0s11.5G / 3.0s12.3G / 3.5s13.9G / 7.0s16.9G / 11.6s23.5G / 22.3s
Full-parameter (2× A100)37.7G / 2.7s37.7G / 2.8s37.7G / 3.0s---
LoRA (multinode: 2 servers, 2× A100 each)23.0G / 2.6s23.0G / 2.7s23.0G / 2.8s25.1G / 5.0s27.1G / 9.6s-

Qwen-14B

Method256 tokens512 tokens1024 tokens2048 tokens
LoRA34.0G / 1.6s34.0G / 1.7s35.2G / 3.4s35.1G / 6.2s
LoRA (emb)56.8G / 1.7s56.8G / 1.8s56.8G / 3.4s57.0G / 6.6s
Q-LoRA18.6G / 5.4s18.6G / 5.5s18.6G / 5.9s20.1G / 10.5s
Full-parameter (2× A100)72.5G / 4.2s72.5G / 4.3s72.5G / 4.5s-

Qwen-72B

MethodGPUs256 tokens512 tokens1024 tokens
LoRA + DeepSpeed ZeRO 34× A10061.1G / 4.5s61.1G / 4.6s62.9G / 5.4s
Q-LoRA (Int4)1× A10050.4G / 12.4s50.4G / 12.8s51.5G / 13.9s
“LoRA (emb)” refers to training with embedding and output layers as trainable parameters.

KV Cache Quantization Impact

With KV cache quantization enabled, memory usage for different configurations:

Batch Size Scaling (Qwen-7B BF16, 1024 tokens)

KV Cachebs=1bs=4bs=16bs=32bs=64bs=100
No16.3GB24.1GB31.7GB48.7GBOOMOOM
Yes15.5GB17.2GB22.3GB30.2GB48.2GB72.4GB

Sequence Length Scaling (Qwen-7B BF16, batch size=1)

KV Cache5121024204840968192
No15.2GB16.3GB17.6GB19.5GB23.2GB
Yes15.0GB15.5GB15.8GB16.6GB17.6GB

Multi-GPU Configurations

Pipeline Parallelism

For models that don’t fit on a single GPU, use automatic device mapping:
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-14B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()
Native pipeline parallelism has lower efficiency. For production, consider using vLLM with FastChat.
ModelRecommended Configuration
Qwen-7BSingle RTX 3090/4090 (BF16) or RTX 3070 (Int4)
Qwen-14BSingle A100 40GB (BF16) or RTX 3090 (Int4)
Qwen-72B2× A100 80GB (BF16) or Single A100 80GB (Int4)
Qwen-72B + vLLM2× A100 80GB for optimal throughput

CPU-Only Deployment

Qwen can run on CPU, but with significantly lower performance. For efficient CPU deployment, use qwen.cpp:
  • Pure C++ implementation
  • Optimized for CPU inference
  • Supports quantization

Direct CPU Inference

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="cpu",
    trust_remote_code=True
).eval()
Direct CPU inference is extremely slow and not recommended for production use.

Cloud and API Deployment

DashScope API

The simplest deployment option through Alibaba Cloud:
  • qwen-turbo: Faster responses
  • qwen-plus: Better performance
  • No hardware management required
  • Scalable inference
See the DashScope documentation for details.

Specialized Hardware

Ascend 910: Supported for inference Hygon DCU: Supported for inference
x86 with OpenVINO: Supported on Core™/Xeon® Scalable Processors and Arc™ GPU

Docker Deployments

Pre-built Docker images are available to simplify environment setup. See the main documentation for Docker usage.

Optimization Recommendations

Int4 Quantization when:
  • Memory is the primary constraint
  • Inference speed is important
  • Slight quality degradation is acceptable
Int8 Quantization when:
  • Balance between memory and quality needed
  • Memory is moderately constrained
BF16/FP16 when:
  • Maximum quality is required
  • Sufficient GPU memory available
  • Training or fine-tuning
Flash Attention 2 provides:
  • 40% speedup for batch inference
  • Lower memory consumption
  • Better scaling with sequence length
Requires compatible GPU architecture (Ampere, Ada, Hopper).
Enable for:
  • Longer sequences (4096+ tokens)
  • Larger batch sizes
  • Memory-constrained scenarios
Not compatible with Flash Attention (will auto-disable Flash Attention if both enabled).
  • DeepSpeed ZeRO 3 requires high inter-node communication bandwidth
  • ZeRO 2 recommended for multinode LoRA fine-tuning
  • Test network throughput before scaling to multiple nodes

Model Specifications Summary

ModelRelease DateMax LengthPretrained TokensMin GPU (Finetuning Q-LoRA)Min GPU (Int4 Inference)
Qwen-1.8B2023.11.3032K2.2T5.8GB2.9GB
Qwen-7B2023.08.0332K2.4T11.5GB8.2GB
Qwen-14B2023.09.258K3.0T18.7GB13.0GB
Qwen-72B2023.11.3032K3.0T61.4GB48.9GB

Troubleshooting

For common hardware-related issues, see the Troubleshooting and FAQ pages.

Build docs developers (and LLMs) love