Overview
The nanochat tokenizer is a Byte-Pair Encoding (BPE) tokenizer trained with a GPT-4 style splitting pattern.get_tokenizer
Returns the trained tokenizer instance.Tokenizer instance with
encode() and decode() methodsTokenizer Methods
encode(text)
Encodes text to token IDs:Input text to tokenize
List of token IDs
decode(token_ids)
Decodes token IDs back to text:Token IDs to decode
Decoded text string
get_token_bytes
Returns a tensor containing the byte length of each token’s UTF-8 encoding.Device to load the token bytes tensor on (“cpu”, “cuda”, etc.)
Tensor of shape
(vocab_size,) containing byte counts for each tokenTokenizer Configuration
Vocabulary Size
The default vocabulary size is 32,768 (2^15) tokens.Special Tokens
The tokenizer includes several special tokens:End of text / padding token (same as GPT-2)
User message marker for chat format
Assistant message marker for chat format
End of message marker
Python code execution tool marker
End of Python code block
Splitting Pattern
The tokenizer uses a GPT-4 style splitting pattern that:- Splits on whitespace boundaries
- Keeps punctuation separate
- Handles contractions properly
- Preserves digits in groups
Training the Tokenizer
To train a new tokenizer from scratch:- Downloads ~2B characters of pretraining data
- Trains a BPE tokenizer with vocab size 32,768
- Uses the GPT-4 splitting pattern
- Saves the tokenizer to
~/.cache/nanochat/tokenizer.json
Evaluation
Evaluate tokenizer compression ratio:- Compression ratio: How many bytes each token represents on average
- Vocabulary statistics: Token usage distribution
- Special token IDs: Mapping of special tokens to IDs
Usage Example
See Also
- Tokenization Guide - Complete tokenizer training guide
- Base Training - Using the tokenizer for pretraining