Overview
Nanochat uses a custom BPE (Byte Pair Encoding) tokenizer trained on your data. The tokenizer is implemented usingRustBPETokenizer which combines rustbpe for training and tiktoken for fast inference.
Quick Start
- Vocabulary size: 32,768 (2^15)
- Training data: 2 billion characters
- Document cap: 10,000 characters per document
Configuration
Vocabulary size for the tokenizer. Common values are powers of 2 (16384, 32768, 65536).
Maximum number of characters to train on. Default is 2 billion characters.
Maximum characters per document. Documents longer than this are truncated to prevent outliers from dominating training.
Training Details
The tokenizer training process:- Data loading: Streams documents from parquet files using
parquets_iter_batched(split="train") - Document capping: Truncates each document to
--doc-capcharacters - BPE training: Trains using GPT-4 style splitting pattern
- Token bytes mapping: Creates a mapping from token IDs to byte counts for bits-per-byte evaluation
Splitting Pattern
The tokenizer uses a GPT-4-style regex pattern with one modification:\p{N}{1,2} instead of \p{N}{1,3} to avoid wasting tokens on numbers for smaller vocabularies.
Special Tokens
The tokenizer includes these special tokens:<|bos|>- Beginning of sequence (document delimiter)<|user_start|>,<|user_end|>- User messages (for chat)<|assistant_start|>,<|assistant_end|>- Assistant messages<|python_start|>,<|python_end|>- Python tool invocations<|output_start|>,<|output_end|>- Tool outputs
Output
The training script saves:- tokenizer.pkl - Pickled tiktoken encoding object
- token_bytes.pt - Tensor mapping token IDs to byte counts (for bits-per-byte metric)
$NANOCHAT_BASE_DIR/tokenizer/
Token Bytes Metric
The tokenizer computes atoken_bytes tensor that maps each token ID to its UTF-8 byte count. This enables measuring model performance in bits per byte rather than bits per token, making the metric invariant to vocabulary size:
Example Usage
Train a smaller tokenizer for testing:Performance
Training typically takes:- 2B characters, 32K vocab: ~2-5 minutes
- 10B characters, 64K vocab: ~10-20 minutes