Skip to main content

Overview

The nanochat tokenizer is a Byte-Pair Encoding (BPE) tokenizer trained with a GPT-4 style splitting pattern.

get_tokenizer

Returns the trained tokenizer instance.
from nanochat.tokenizer import get_tokenizer

tokenizer = get_tokenizer()
tokenizer
object
Tokenizer instance with encode() and decode() methods

Tokenizer Methods

encode(text)

Encodes text to token IDs:
text = "Hello, world!"
token_ids = tokenizer.encode(text)
# Returns: list of integers
text
str
required
Input text to tokenize
token_ids
list[int]
List of token IDs

decode(token_ids)

Decodes token IDs back to text:
token_ids = [15496, 11, 1917, 0]
text = tokenizer.decode(token_ids)
# Returns: "Hello, world!"
token_ids
list[int]
required
Token IDs to decode
text
str
Decoded text string

get_token_bytes

Returns a tensor containing the byte length of each token’s UTF-8 encoding.
from nanochat.tokenizer import get_token_bytes

token_bytes = get_token_bytes(device="cpu")
# token_bytes is a tensor of shape (vocab_size,)
# Each element is the number of bytes for that token

# Calculate average
avg_bytes = token_bytes.float().mean().item()
print(f"Average bytes per token: {avg_bytes:.3f}")
This metric is used for bits-per-byte evaluation. A well-trained tokenizer typically achieves ~3-4 bytes per token.
device
str
default:"cpu"
Device to load the token bytes tensor on (“cpu”, “cuda”, etc.)
token_bytes
torch.Tensor
Tensor of shape (vocab_size,) containing byte counts for each token

Tokenizer Configuration

Vocabulary Size

The default vocabulary size is 32,768 (2^15) tokens.

Special Tokens

The tokenizer includes several special tokens:
<|endoftext|>
int
default:"0"
End of text / padding token (same as GPT-2)
<|user|>
int
User message marker for chat format
<|assistant|>
int
Assistant message marker for chat format
<|end|>
int
End of message marker
<|python|>
int
Python code execution tool marker
<|/python|>
int
End of Python code block

Splitting Pattern

The tokenizer uses a GPT-4 style splitting pattern that:
  • Splits on whitespace boundaries
  • Keeps punctuation separate
  • Handles contractions properly
  • Preserves digits in groups
This pattern improves tokenization quality compared to the original GPT-2 pattern.

Training the Tokenizer

To train a new tokenizer from scratch:
python -m scripts.tok_train
The training process:
  1. Downloads ~2B characters of pretraining data
  2. Trains a BPE tokenizer with vocab size 32,768
  3. Uses the GPT-4 splitting pattern
  4. Saves the tokenizer to ~/.cache/nanochat/tokenizer.json

Evaluation

Evaluate tokenizer compression ratio:
python -m scripts.tok_eval
This reports:
  • Compression ratio: How many bytes each token represents on average
  • Vocabulary statistics: Token usage distribution
  • Special token IDs: Mapping of special tokens to IDs

Usage Example

from nanochat.tokenizer import get_tokenizer, get_token_bytes

# Load tokenizer
tok = get_tokenizer()

# Encode text
text = "The quick brown fox jumps over the lazy dog."
token_ids = tok.encode(text)
print(f"Tokens: {token_ids}")
print(f"Number of tokens: {len(token_ids)}")

# Decode back
recovered = tok.decode(token_ids)
assert recovered == text

# Check compression
avg_bytes = get_token_bytes(tok)
print(f"Bytes per token: {avg_bytes:.3f}")

See Also

Build docs developers (and LLMs) love