tiktoken package. The tokenizer includes both regular BPE tokens and special control tokens.
Quick Start
Vocabulary Overview
Qwen’s tokenizer contains 151,851 tokens total:- 151,643 regular tokens - BPE tokens learned from UTF-8 byte sequences
- 208 special tokens - Control tokens for specific model functions
Design Philosophy
The tokenizer is optimized for:- Efficient Chinese, English, and code encoding
- Multilingual support without vocabulary expansion
- Number segmentation by single digits
- No unknown tokens - falls back to single bytes if needed
Regular Tokens
Regular tokens are BPE tokens learned from byte sequences encoded using UTF-8.UTF-8 Byte Encoding
Unlike tokenizers that operate on Unicode codepoints, Qwen operates directly on UTF-8 bytes:Handling Decoding Errors
You can control UTF-8 decoding error handling:Retrieving Vocabulary
Access the complete regular token vocabulary:Special Tokens
Special tokens signify specific functions to the model and don’t typically appear in input text.Used Special Tokens
Qwen-7B (Base Model):<|endoftext|>- End of document marker
<|endoftext|>- End of document marker<|im_start|>- Start of ChatML turn<|im_end|>- End of ChatML turn
These tokens have determined meanings and should not be used for other purposes.
Extra Special Tokens
Qwen reserves extra special tokens for custom use:<|extra_0|>through<|extra_204|>(205 tokens)- Available for fine-tuning and custom applications
Retrieving Special Tokens
BOS, EOS, PAD, UNK Tokens
For frameworks that require these tokens (e.g., during fine-tuning):Injection Attack Prevention
The Problem
What happens when special token surface forms appear in text?Default Behavior
By default, Qwen now parses surface forms as special tokens (the second behavior above).
Safe Tokenization
To prevent injection attacks, useallowed_special=set():
Fine-Grained Control
Allow specific special tokens:Backward Compatibility
The unsafe default is equivalent to:Vocabulary Expansion
Process
You cannot directly add words to BPE vocabulary. You must learn new merges: 1. Prepare vocabulary fileqwen_extra_vocab.txt:
token<TAB>frequency (frequencies needed for BPE computation)
2. Get base vocabulary:
- Base file:
qwen.tiktoken - Start index: 151,851 (default, or override inactive control tokens)
Requires tokenizer code from after 2023-10-08. Older versions must manually append
qwen.tiktoken with content from qwen_extra.tiktoken.Caveats
Unicode Boundary Issues
Because Qwen operates on UTF-8 bytes (not Unicode codepoints), limited training data may not recognize Unicode boundaries correctly. Problem: Invalid merges across codepoint boundaries may occur:Context-Dependent Tokenization
BPE is distribution-based, not semantic. Tokens may split differently in different contexts:Existing Token Example
Multilingual Efficiency
While optimized for Chinese, English, and code, Qwen achieves high compression rates for:- Thai (th)
- Hebrew (he)
- Arabic (ar)
- Korean (ko)
- Vietnamese (vi)
- Japanese (ja)
- Turkish (tr)
- Indonesian (id)
- Polish (pl)
- Russian (ru)
- Dutch (nl)
- Portuguese (pt)
- Italian (it)
- German (de)
- Spanish (es)
- French (fr)
Best Practices
When to use allowed_special=set()
When to use allowed_special=set()
Use injection prevention when:
- Processing untrusted user input
- Building chatbots or interactive systems
- Handling code or structured text that might contain special token patterns
When to expand vocabulary
When to expand vocabulary
Consider vocabulary expansion when:
- Fine-tuning on domain-specific text with unique terminology
- Working with languages that would benefit from better tokenization
- You have sufficient training data for the new tokens (1000+ examples)
- You’re only doing inference
- You have limited training data
- Your use case works well with existing tokenization
Handling incomplete UTF-8 sequences
Handling incomplete UTF-8 sequences
During generation, incomplete UTF-8 sequences may appear: