Tokenizer

Overview

The nanochat tokenizer is a Byte-Pair Encoding (BPE) tokenizer trained with a GPT-4 style splitting pattern.

get_tokenizer

Returns the trained tokenizer instance.

from nanochat.tokenizer import get_tokenizer

tokenizer = get_tokenizer()

tokenizer

object

Tokenizer instance with encode() and decode() methods

Tokenizer Methods

encode(text)

Encodes text to token IDs:

text = "Hello, world!"
token_ids = tokenizer.encode(text)
# Returns: list of integers

text

str

required

Input text to tokenize

token_ids

list[int]

List of token IDs

decode(token_ids)

Decodes token IDs back to text:

token_ids = [15496, 11, 1917, 0]
text = tokenizer.decode(token_ids)
# Returns: "Hello, world!"

token_ids

list[int]

required

Token IDs to decode

text

str

Decoded text string

get_token_bytes

Returns a tensor containing the byte length of each token’s UTF-8 encoding.

from nanochat.tokenizer import get_token_bytes

token_bytes = get_token_bytes(device="cpu")
# token_bytes is a tensor of shape (vocab_size,)
# Each element is the number of bytes for that token

# Calculate average
avg_bytes = token_bytes.float().mean().item()
print(f"Average bytes per token: {avg_bytes:.3f}")

This metric is used for bits-per-byte evaluation. A well-trained tokenizer typically achieves ~3-4 bytes per token.

device

str

default:"cpu"

Device to load the token bytes tensor on (“cpu”, “cuda”, etc.)

token_bytes

torch.Tensor

Tensor of shape (vocab_size,) containing byte counts for each token

Tokenizer Configuration

Vocabulary Size

The default vocabulary size is 32,768 (2^15) tokens.

Special Tokens

The tokenizer includes several special tokens:

<|endoftext|>

int

default:"0"

End of text / padding token (same as GPT-2)

<|user|>

int

User message marker for chat format

<|assistant|>

int

Assistant message marker for chat format

<|end|>

int

End of message marker

<|python|>

int

Python code execution tool marker

<|/python|>

int

End of Python code block

Splitting Pattern

The tokenizer uses a GPT-4 style splitting pattern that:

Splits on whitespace boundaries
Keeps punctuation separate
Handles contractions properly
Preserves digits in groups

This pattern improves tokenization quality compared to the original GPT-2 pattern.

Training the Tokenizer

To train a new tokenizer from scratch:

python -m scripts.tok_train

The training process:

Downloads ~2B characters of pretraining data
Trains a BPE tokenizer with vocab size 32,768
Uses the GPT-4 splitting pattern
Saves the tokenizer to ~/.cache/nanochat/tokenizer.json

Evaluation

Evaluate tokenizer compression ratio:

python -m scripts.tok_eval

This reports:

Compression ratio: How many bytes each token represents on average
Vocabulary statistics: Token usage distribution
Special token IDs: Mapping of special tokens to IDs

Usage Example

from nanochat.tokenizer import get_tokenizer, get_token_bytes

# Load tokenizer
tok = get_tokenizer()

# Encode text
text = "The quick brown fox jumps over the lazy dog."
token_ids = tok.encode(text)
print(f"Tokens: {token_ids}")
print(f"Number of tokens: {len(token_ids)}")

# Decode back
recovered = tok.decode(token_ids)
assert recovered == text

# Check compression
avg_bytes = get_token_bytes(tok)
print(f"Bytes per token: {avg_bytes:.3f}")

Core Modules

Training

Scripts

Tasks

Overview

get_tokenizer

Tokenizer Methods

encode(text)

decode(token_ids)

get_token_bytes

Tokenizer Configuration

Vocabulary Size

Special Tokens

Splitting Pattern

Training the Tokenizer

Evaluation

Usage Example

See Also

Build docs developers (and LLMs) love

Core Modules

Training

Scripts

Tasks

​Overview

​get_tokenizer

​Tokenizer Methods

​encode(text)

​decode(token_ids)

​get_token_bytes

​Tokenizer Configuration

​Vocabulary Size

​Special Tokens

​Splitting Pattern

​Training the Tokenizer

​Evaluation

​Usage Example

​See Also

Build docs developers (and LLMs) love

Overview

get_tokenizer

Tokenizer Methods

encode(text)

decode(token_ids)

get_token_bytes

Tokenizer Configuration

Vocabulary Size

Special Tokens

Splitting Pattern

Training the Tokenizer

Evaluation

Usage Example

See Also