Skip to main content

Overview

Nanochat uses a custom BPE (Byte Pair Encoding) tokenizer trained on your data. The tokenizer is implemented using RustBPETokenizer which combines rustbpe for training and tiktoken for fast inference.

Quick Start

python -m scripts.tok_train
This trains a tokenizer with default settings:
  • Vocabulary size: 32,768 (2^15)
  • Training data: 2 billion characters
  • Document cap: 10,000 characters per document

Configuration

--vocab-size
int
default:"32768"
Vocabulary size for the tokenizer. Common values are powers of 2 (16384, 32768, 65536).
--max-chars
int
default:"2000000000"
Maximum number of characters to train on. Default is 2 billion characters.
--doc-cap
int
default:"10000"
Maximum characters per document. Documents longer than this are truncated to prevent outliers from dominating training.

Training Details

The tokenizer training process:
  1. Data loading: Streams documents from parquet files using parquets_iter_batched(split="train")
  2. Document capping: Truncates each document to --doc-cap characters
  3. BPE training: Trains using GPT-4 style splitting pattern
  4. Token bytes mapping: Creates a mapping from token IDs to byte counts for bits-per-byte evaluation

Splitting Pattern

The tokenizer uses a GPT-4-style regex pattern with one modification:
SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
Key difference: Uses \p{N}{1,2} instead of \p{N}{1,3} to avoid wasting tokens on numbers for smaller vocabularies.

Special Tokens

The tokenizer includes these special tokens:
  • <|bos|> - Beginning of sequence (document delimiter)
  • <|user_start|>, <|user_end|> - User messages (for chat)
  • <|assistant_start|>, <|assistant_end|> - Assistant messages
  • <|python_start|>, <|python_end|> - Python tool invocations
  • <|output_start|>, <|output_end|> - Tool outputs

Output

The training script saves:
  1. tokenizer.pkl - Pickled tiktoken encoding object
  2. token_bytes.pt - Tensor mapping token IDs to byte counts (for bits-per-byte metric)
Location: $NANOCHAT_BASE_DIR/tokenizer/

Token Bytes Metric

The tokenizer computes a token_bytes tensor that maps each token ID to its UTF-8 byte count. This enables measuring model performance in bits per byte rather than bits per token, making the metric invariant to vocabulary size:
token_bytes = [len(token_str.encode("utf-8")) for token in vocab]
# Special tokens get byte count = 0 (not counted in evaluation)

Example Usage

Train a smaller tokenizer for testing:
python -m scripts.tok_train \
  --vocab-size=16384 \
  --max-chars=500000000 \
  --doc-cap=5000
Train a larger tokenizer:
python -m scripts.tok_train \
  --vocab-size=65536 \
  --max-chars=10000000000

Performance

Training typically takes:
  • 2B characters, 32K vocab: ~2-5 minutes
  • 10B characters, 64K vocab: ~10-20 minutes
The exact time depends on CPU speed and data loading.

Build docs developers (and LLMs) love