Tokenization

Qwen uses BPE (Byte Pair Encoding) tokenization on UTF-8 bytes using the tiktoken package. The tokenizer includes both regular BPE tokens and special control tokens.

Quick Start

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)

Vocabulary Overview

Qwen’s tokenizer contains 151,851 tokens total:

151,643 regular tokens - BPE tokens learned from UTF-8 byte sequences
208 special tokens - Control tokens for specific model functions

Design Philosophy

The tokenizer is optimized for:

Efficient Chinese, English, and code encoding
Multilingual support without vocabulary expansion
Number segmentation by single digits
No unknown tokens - falls back to single bytes if needed

Regular Tokens

Regular tokens are BPE tokens learned from byte sequences encoded using UTF-8.

UTF-8 Byte Encoding

Unlike tokenizers that operate on Unicode codepoints, Qwen operates directly on UTF-8 bytes:

>>> tokenizer.decode([51461])
' �'

>>> tokenizer.convert_ids_to_tokens([51461])
[b' \xe6\xa0']

>>> b' \xe6\xa0'.decode("utf-8", errors='replace')
' �'

This may produce replacement characters (�) in incomplete generation. The tokens complete when more bytes arrive:

>>> tokenizer.decode([51461, 117])
' 根'

>>> tokenizer.convert_ids_to_tokens([51461, 117])
[b' \xe6\xa0', b'\xb9']

>>> b' \xe6\xa0\xb9'.decode("utf-8", errors='replace')
' 根'

Handling Decoding Errors

You can control UTF-8 decoding error handling:

# Change for a single decode call
tokenizer.decode([51461], errors="ignore")

# Change permanently
tokenizer = AutoTokenizer.from_pretrained(
    'Qwen/Qwen-7B',
    trust_remote_code=True,
    errors="ignore"
)

For more error handling options, see the Python documentation.

Retrieving Vocabulary

Access the complete regular token vocabulary:

vocab = tokenizer.get_vocab()

We do not support or recommend adding regular tokens to the vocabulary directly. See Vocabulary Expansion for the correct approach.

Special Tokens

Special tokens signify specific functions to the model and don’t typically appear in input text.

Used Special Tokens

Qwen-7B (Base Model):

<|endoftext|> - End of document marker

Qwen-7B-Chat:

<|endoftext|> - End of document marker
<|im_start|> - Start of ChatML turn
<|im_end|> - End of ChatML turn

These tokens have determined meanings and should not be used for other purposes.

Extra Special Tokens

Qwen reserves extra special tokens for custom use:

<|extra_0|> through <|extra_204|> (205 tokens)
Available for fine-tuning and custom applications

Retrieving Special Tokens

special_tokens = tokenizer.special_tokens

BOS, EOS, PAD, UNK Tokens

The concepts of bos, eos, unk, pad, mask, sep are not applicable to Qwen pretrained models.

For frameworks that require these tokens (e.g., during fine-tuning):

tokenizer = AutoTokenizer.from_pretrained(
    'Qwen/Qwen-7B',
    trust_remote_code=True,
    pad_token='<|endoftext|>'
)

Setting bos, eos, or unk without fine-tuning may introduce unknown behavior. Do not use <|endoftext|> as eos unless you understand the implications.

Injection Attack Prevention

The Problem

What happens when special token surface forms appear in text?

text = 'print("<|endoftext|>")'

Should this be tokenized as: Correct (safe):

ids: [1350, 9639, 91, 8691, 723, 427, 91, 82598]
tokens: [b'print', b'("<', b'|', b'endo', b'ft', b'ext', b'|', b'>")']  

Incorrect (injection vulnerability):

ids: [1350, 445, 151643, 899]
tokens: [b'print', b'("', '<|endoftext|>', b'")']  

Default Behavior

By default, Qwen now parses surface forms as special tokens (the second behavior above).

This matches community practice but is potentially unsafe.

Safe Tokenization

To prevent injection attacks, use allowed_special=set():

>>> tokenizer('print("<|endoftext|>")', allowed_special=set())
{'input_ids': [1350, 9639, 91, 8691, 723, 427, 91, 82598], ...}

Fine-Grained Control

Allow specific special tokens:

>>> tokenizer(
...     'print("<|extra_0|>")<|endoftext|>',
...     allowed_special={'<|endoftext|>'}
... )
{'input_ids': [1350, 9639, 91, 15460, 62, 15, 91, 82598, 151643], ...}

Raise errors on specific tokens:

>>> tokenizer(
...     'print("<|extra_0|>")<|endoftext|>',
...     allowed_special={'<|endoftext|>'},
...     disallowed_special=('<|extra_0|>',)
... )
ValueError: Encountered text corresponding to disallowed special token '<|extra_0|>'.

For more information, see the tiktoken documentation.

Backward Compatibility

The unsafe default is equivalent to:

tokenizer('print("<|endoftext|>")', allowed_special="all", disallowed_special=())

Vocabulary Expansion

Read carefully and understand the caveats before expanding vocabulary. Use at your own risk.

Process

You cannot directly add words to BPE vocabulary. You must learn new merges: 1. Prepare vocabulary file qwen_extra_vocab.txt:

我是一只猫	20
你是一只猫	10
他是一只猫	5
一只	200
一只猫	100
夸张的 比喻手法	20

Format: token<TAB>frequency (frequencies needed for BPE computation) 2. Get base vocabulary:

Base file: qwen.tiktoken
Start index: 151,851 (default, or override inactive control tokens)

3. Learn new merges:

python add_merges.py qwen.tiktoken qwen_extra.tiktoken qwen_extra_vocab.txt

The script is available here. Output:

WARNING - 夸张的 比喻手法 would be pre-tokenized to ['夸张的', ' 比喻手法'], cannot add
WARNING - word 一只 is already a token b'\xe4\xb8\x80\xe5\x8f\xaa', skipping
INFO - number of existing merges: 151643
INFO - number of words for expanding: 4
DEBUG - (b'\xe4\xb8\x80\xe5\x8f\xaa', b'\xe7\x8c\xab') (一只猫) selected, freq 100
DEBUG - (b'\xe5\x8f\xaa', b'\xe7\x8c\xab') (只猫) selected, freq 35
DEBUG - (b'\xe6\x98\xaf\xe4\xb8\x80', b'\xe5\x8f\xaa\xe7\x8c\xab') (是一只猫) selected, freq 35
DEBUG - (b'\xe6\x88\x91', b'\xe6\x98\xaf\xe4\xb8\x80\xe5\x8f\xaa\xe7\x8c\xab') (我是一只猫) selected, freq 20
INFO - number of newly learned merges: 6

4. Use expanded vocabulary:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B",
    trust_remote_code=True,
    extra_vocab_file="qwen_extra.tiktoken"
)

>>> len(tokenizer)
151857

>>> tokenizer("我是一只猫")
{'input_ids': [151854], 'token_type_ids': [0], 'attention_mask': [1]}

Requires tokenizer code from after 2023-10-08. Older versions must manually append qwen.tiktoken with content from qwen_extra.tiktoken.

5. Fine-tune the model to learn the new tokens.

Caveats

Unicode Boundary Issues

Because Qwen operates on UTF-8 bytes (not Unicode codepoints), limited training data may not recognize Unicode boundaries correctly. Problem: Invalid merges across codepoint boundaries may occur:

String: "一只" = b'\xe4\xb8\x80\xe5\x8f\xaa'
一 = b'\xe4\xb8\x80'
只 = b'\xe5\x8f\xaa'

# Invalid merge could happen:
b'\x80\xe5'  # crosses codepoint boundary!

Solution: Include all Unicode codepoints from your words with higher frequencies:

一	1000
只	1000
猫	1000
我是一只猫	20
你是一只猫	10

Context-Dependent Tokenization

BPE is distribution-based, not semantic. Tokens may split differently in different contexts:

>>> tokenizer.tokenize("Panda")
[b'P', b'anda']

>>> tokenizer.tokenize(" Panda")
[b' Panda']

>>> tokenizer.tokenize("Pandas")
[b'P', b'andas']

>>> tokenizer.tokenize(" Pandas")
[b' Pand', b'as']

This reflects frequency in training data and is normal behavior.

Existing Token Example

# "一只" is already a token
# "只猫" is learned as new token
# Because "是一" also exists and has higher priority:

是|一|只|猫 → 是一|只|猫 → 是一|只猫 → 是一只猫

This is expected BPE behavior based on frequency distribution.

Multilingual Efficiency

While optimized for Chinese, English, and code, Qwen achieves high compression rates for:

Thai (th)
Hebrew (he)
Arabic (ar)
Korean (ko)
Vietnamese (vi)
Japanese (ja)
Turkish (tr)
Indonesian (id)
Polish (pl)
Russian (ru)
Dutch (nl)
Portuguese (pt)
Italian (it)
German (de)
Spanish (es)
French (fr)

This provides strong scalability and efficiency for multilingual applications without vocabulary expansion.

Best Practices

When to use allowed_special=set()

Use injection prevention when:

Processing untrusted user input
Building chatbots or interactive systems
Handling code or structured text that might contain special token patterns

When to expand vocabulary

Consider vocabulary expansion when:

Fine-tuning on domain-specific text with unique terminology
Working with languages that would benefit from better tokenization
You have sufficient training data for the new tokens (1000+ examples)

Do NOT expand vocabulary if:

You’re only doing inference
You have limited training data
Your use case works well with existing tokenization

Handling incomplete UTF-8 sequences

During generation, incomplete UTF-8 sequences may appear:

# Set error handling for cleaner output
tokenizer = AutoTokenizer.from_pretrained(
    'Qwen/Qwen-7B',
    trust_remote_code=True,
    errors="ignore"  # or "replace" (default)
)

Guides

Support

Tokenization

Quick Start

Vocabulary Overview

Design Philosophy

Regular Tokens

UTF-8 Byte Encoding

Handling Decoding Errors

Retrieving Vocabulary

Special Tokens

Used Special Tokens

Extra Special Tokens

Retrieving Special Tokens

BOS, EOS, PAD, UNK Tokens

Injection Attack Prevention

The Problem

Default Behavior

Safe Tokenization

Fine-Grained Control

Backward Compatibility

Vocabulary Expansion

Process

Caveats

Unicode Boundary Issues

Context-Dependent Tokenization

Existing Token Example

Multilingual Efficiency

Best Practices

Additional Resources

Build docs developers (and LLMs) love

Guides

Support

​Quick Start

​Vocabulary Overview

​Design Philosophy

​Regular Tokens

​UTF-8 Byte Encoding

​Handling Decoding Errors

​Retrieving Vocabulary

​Special Tokens

​Used Special Tokens

​Extra Special Tokens

​Retrieving Special Tokens

​BOS, EOS, PAD, UNK Tokens

​Injection Attack Prevention

​The Problem

​Default Behavior

​Safe Tokenization

​Fine-Grained Control

​Backward Compatibility

​Vocabulary Expansion

​Process

​Caveats

​Unicode Boundary Issues

​Context-Dependent Tokenization

​Existing Token Example

​Multilingual Efficiency

​Best Practices

​Additional Resources

Build docs developers (and LLMs) love

Quick Start

Vocabulary Overview

Design Philosophy

Regular Tokens

UTF-8 Byte Encoding

Handling Decoding Errors

Retrieving Vocabulary

Special Tokens

Used Special Tokens

Extra Special Tokens

Retrieving Special Tokens

BOS, EOS, PAD, UNK Tokens

Injection Attack Prevention

The Problem

Default Behavior

Safe Tokenization

Fine-Grained Control

Backward Compatibility

Vocabulary Expansion

Process

Caveats

Unicode Boundary Issues

Context-Dependent Tokenization

Existing Token Example

Multilingual Efficiency

Best Practices

Additional Resources