Baseline tokenizer
The default tokenizer is a SentencePiece BPE model with a 1024-token vocabulary trained on FineWeb:TOKENIZER_PATH environment variable in Hyperparameters:
VOCAB_SIZE environment variable must match the tokenizer’s vocabulary size exactly:
Why BPB enables tokenizer-agnostic comparison
Token-level cross-entropy loss is not comparable across tokenizers with different vocabulary sizes:- A byte-level tokenizer (vocab=256) has ~4 tokens per English word. Each token must carry 1 byte of information.
- A subword tokenizer (vocab=32 000) has ~1.3 tokens per English word. Each token must carry ~2.5 bytes of information.
build_sentencepiece_luts()
To count bytes at validation time without decoding every token, the script precomputes three lookup tables over the full vocabulary:
Lookup table descriptions
base_bytes_lut — byte count per token ID
base_bytes_lut — byte count per token ID
For each token ID, stores the number of UTF-8 bytes in the token’s surface form, excluding any leading
▁ space. Byte tokens are counted as 1. Control, unknown, and unused tokens are left as 0.has_leading_space_lut — whether token has a ▁ prefix
has_leading_space_lut — whether token has a ▁ prefix
SentencePiece encodes word-initial spaces as a
▁ prefix on the following token. This table records which token IDs carry such a prefix. The space is counted as 1 extra byte, but only when the preceding token is not a boundary token.is_boundary_token_lut — control/unknown/unused tokens
is_boundary_token_lut — control/unknown/unused tokens
All token IDs are initialised as boundary tokens (
True). Normal tokens are set to False during the loop. This is used to suppress the leading-space correction when the preceding context token is a special/control token:Leading space correction detail
SentencePiece absorbs the space before a word into the token that begins the word (e.g.," hello" becomes ▁hello). When counting bytes:
base_bytes_np[token_id]countslen("hello".encode("utf-8")) = 5has_leading_space_lut[token_id]isTrue, so one extra byte is added — unless the previous token is a boundary token (e.g., start-of-sequence)
Vocabulary size tradeoffs
Small vocabulary
More tokens per byte → lower tokens-per-byte ratio → BPB penalty is smaller (each token must compress less). But the embedding table is cheaper: fewer rows mean more parameter budget for the model body.
Large vocabulary
Fewer tokens per byte → each token must carry more semantic signal. The embedding table consumes a larger fraction of the 16 MB budget, leaving less room for transformer weight matrices.
Bringing a custom tokenizer
You can retokenize FineWeb with any SentencePiece vocabulary using the data export pipeline described indata/README.md. Once you have retokenized shards:
VOCAB_SIZE matches sp.vocab_size() at startup and will raise an error if they differ.
