Tokenization

Overview

Nanochat uses a custom BPE (Byte Pair Encoding) tokenizer trained on your data. The tokenizer is implemented using RustBPETokenizer which combines rustbpe for training and tiktoken for fast inference.

Quick Start

python -m scripts.tok_train

This trains a tokenizer with default settings:

Vocabulary size: 32,768 (2^15)
Training data: 2 billion characters
Document cap: 10,000 characters per document

Configuration

--vocab-size

int

default:"32768"

Vocabulary size for the tokenizer. Common values are powers of 2 (16384, 32768, 65536).

--max-chars

int

default:"2000000000"

Maximum number of characters to train on. Default is 2 billion characters.

--doc-cap

int

default:"10000"

Maximum characters per document. Documents longer than this are truncated to prevent outliers from dominating training.

Training Details

The tokenizer training process:

Data loading: Streams documents from parquet files using parquets_iter_batched(split="train")
Document capping: Truncates each document to --doc-cap characters
BPE training: Trains using GPT-4 style splitting pattern
Token bytes mapping: Creates a mapping from token IDs to byte counts for bits-per-byte evaluation

Splitting Pattern

The tokenizer uses a GPT-4-style regex pattern with one modification:

SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

Key difference: Uses \p{N}{1,2} instead of \p{N}{1,3} to avoid wasting tokens on numbers for smaller vocabularies.

Special Tokens

The tokenizer includes these special tokens:

<|bos|> - Beginning of sequence (document delimiter)
<|user_start|>, <|user_end|> - User messages (for chat)
<|assistant_start|>, <|assistant_end|> - Assistant messages
<|python_start|>, <|python_end|> - Python tool invocations
<|output_start|>, <|output_end|> - Tool outputs

Output

The training script saves:

tokenizer.pkl - Pickled tiktoken encoding object
token_bytes.pt - Tensor mapping token IDs to byte counts (for bits-per-byte metric)

Location: $NANOCHAT_BASE_DIR/tokenizer/

Token Bytes Metric

The tokenizer computes a token_bytes tensor that maps each token ID to its UTF-8 byte count. This enables measuring model performance in bits per byte rather than bits per token, making the metric invariant to vocabulary size:

token_bytes = [len(token_str.encode("utf-8")) for token in vocab]
# Special tokens get byte count = 0 (not counted in evaluation)

Example Usage

Train a smaller tokenizer for testing:

python -m scripts.tok_train \
  --vocab-size=16384 \
  --max-chars=500000000 \
  --doc-cap=5000

Train a larger tokenizer:

python -m scripts.tok_train \
  --vocab-size=65536 \
  --max-chars=10000000000

Performance

Training typically takes:

2B characters, 32K vocab: ~2-5 minutes
10B characters, 64K vocab: ~10-20 minutes

The exact time depends on CPU speed and data loading.

Get Started

Training

Evaluation

Inference

Architecture

Advanced

Overview

Quick Start

Configuration

Training Details

Splitting Pattern

Special Tokens

Output

Token Bytes Metric

Example Usage

Performance

Build docs developers (and LLMs) love

Get Started

Training

Evaluation

Inference

Architecture

Advanced

​Overview

​Quick Start

​Configuration

​Training Details

​Splitting Pattern

​Special Tokens

​Output

​Token Bytes Metric

​Example Usage

​Performance

Build docs developers (and LLMs) love

Overview

Quick Start

Configuration

Training Details

Splitting Pattern

Special Tokens

Output

Token Bytes Metric

Example Usage

Performance