Installation & Environment
Flash attention installation fails
Flash attention installation fails
- Only NVIDIA GPUs with Turing, Ampere, Ada, and Hopper architecture are supported
- Examples: H100, A100, RTX 3090, T4, RTX 2080
- Not supported on older architectures (Pascal, Maxwell, etc.)
Which version of transformers should I use?
Which version of transformers should I use?
transformers>=4.32.0This version includes all necessary features for Qwen models. Using older versions may cause compatibility issues.I downloaded the code and checkpoints but can't load the model locally
I downloaded the code and checkpoints but can't load the model locally
-
Update to latest code:
-
Verify all checkpoint files are downloaded:
- Check if all sharded checkpoint files (
.safetensorsor.bin) are present - Verify file sizes match expected sizes
- Check if all sharded checkpoint files (
-
Ensure git-lfs is installed:
-
Check trust_remote_code is set:
qwen.tiktoken not found
qwen.tiktoken not found
qwen.tiktoken is the tokenizer merge file. You must download it for the model to work.Problem: If you cloned the repository without git-lfs, this file won’t download properly.Solution:transformers_stream_generator/tiktoken/accelerate not found
transformers_stream_generator/tiktoken/accelerate not found
requirements.txt file is available at:
https://github.com/QwenLM/Qwen/blob/main/requirements.txtDemo & Inference
Is there a demo? CLI or Web UI?
Is there a demo? CLI or Web UI?
Can I use CPU only?
Can I use CPU only?
Does Qwen support streaming?
Does Qwen support streaming?
chat_stream() function:modeling_qwen.py for the full implementation.Gibberish output when using chat_stream()
Gibberish output when using chat_stream()
Generation is not related to the instruction
Generation is not related to the instruction
Is quantization supported?
Is quantization supported?
- Qwen-*B-Chat-Int4
- Qwen-*B-Chat-Int8
- Reduced memory usage
- Faster inference
- Minimal performance degradation
Slow performance when processing long sequences
Slow performance when processing long sequences
- Flash Attention 2 support
- Improved attention mechanisms
- Better memory management
Unsatisfactory performance on long sequences
Unsatisfactory performance on long sequences
config.json:true by default. If they’re false, enable them for better long-context performance.What they do:use_dynamic_ntk: NTK-aware interpolation for position embeddingsuse_logn_attn: LogN attention scaling
Finetuning
Can Qwen support SFT or RLHF?
Can Qwen support SFT or RLHF?
- Full-parameter fine-tuning - Update all parameters
- LoRA - Low-rank adaptation, efficient training
- Q-LoRA - Quantized LoRA, even more memory-efficient
Tokenizer
bos_id/eos_id/pad_id not found
bos_id/eos_id/pad_id not found
<|endoftext|> as the separator and padding token during training.For most use cases:Docker
Docker image download is very slow
Docker image download is very slow
Performance & Optimization
How can I speed up inference?
How can I speed up inference?
-
Use quantized models (Int4/Int8)
- Faster than BF16
- Lower memory usage
- Minimal quality loss
-
Enable Flash Attention
- Requires compatible GPU
- Significant speedup for longer sequences
-
Use vLLM for deployment
- Optimized inference engine
- Better batching
- Higher throughput
-
Batch inference
- Process multiple requests together
- 40% speedup with Flash Attention enabled
-
KV cache quantization
- Reduces memory for longer sequences
- Allows larger batch sizes
Out of memory during training/inference
Out of memory during training/inference
- Use quantized models (Int4/Int8)
- Enable KV cache quantization
- Reduce batch size
- Use gradient checkpointing
- Switch to a smaller model variant
- Use Q-LoRA instead of LoRA or full fine-tuning
- Reduce batch size and increase gradient accumulation
- Use DeepSpeed ZeRO optimization
- Train on multiple GPUs
- Reduce sequence length
- Enable gradient checkpointing
Model Selection
Which model size should I choose?
Which model size should I choose?
- Edge devices
- Low-resource scenarios
- Fast inference needed
- Simple tasks
- General use cases
- Good balance of quality and speed
- Single GPU deployment (RTX 3090/4090)
- Most popular choice
- Better performance needed
- More complex tasks
- A100 40GB available
- Best quality
- Complex reasoning tasks
- Research applications
- Multiple A100 GPUs available
Base model vs Chat model?
Base model vs Chat model?
- Completion tasks
- Further pretraining
- Custom fine-tuning from scratch
- Research on base capabilities
- Conversational AI
- Instruction following
- Q&A systems
- Chat applications
- Tool usage
- Most practical applications
Common Errors
trust_remote_code error
trust_remote_code error
trust_remote_code=True when loading Qwen models:Pydantic version conflicts with DeepSpeed
Pydantic version conflicts with DeepSpeed
pydantic>=2.0 and DeepSpeed.Solution:ValueError: Tokenizer class QWenTokenizer does not exist
ValueError: Tokenizer class QWenTokenizer does not exist
peft>=0.8.0.Solutions:-
Downgrade peft:
- Or move tokenizer files: Move tokenizer files elsewhere temporarily when loading with peft 0.8.0+.
Still Need Help?
If your question isn’t answered here:- Check the troubleshooting guide: Troubleshooting
- Search existing issues: GitHub Issues
- Open a new issue: Provide details about your environment, code, and error messages
- Join the community:
- Discord
- WeChat (see main README)