Overview
The challenge uses a preprocessed snapshot of FineWeb, a large-scale English web text dataset curated by Hugging Face. The preprocessing pipeline tokenizes a fixed, shuffled subset of FineWeb documents and exports the result as binary shard files ready for direct streaming by the training script. The default published dataset is hosted atwilldepueoai/parameter-golf on Hugging Face as a dataset repository. It includes tokenizer files and all shard files under a datasets/ subdirectory.
Default variant
sp1024 — SentencePiece BPE with vocabulary size 1024Shard size
100,000,000 tokens per shard (~200 MB each)
Default download
80 training shards (8B tokens); maximum available determined by the manifest
Validation split
Fixed first-50k-document set, always downloaded in full
Shard File Format
Each.bin shard file follows a fixed binary layout:
| Index | Field | Value |
|---|---|---|
| 0 | Magic number | 20240520 |
| 1 | Version | 1 |
| 2 | num_tokens | Token count in this shard |
| 3–255 | Reserved | (unused) |
uint16. The file size is exactly 256 * 4 + num_tokens * 2 bytes. The load_data_shard() function validates both the header magic/version and the file size before reading.
Validation Split
Validation shards follow the naming patternfineweb_val_*.bin. They represent the fixed first-50,000-document set from the frozen shuffled FineWeb export and are always downloaded in full by cached_challenge_fineweb.py.
During training, load_validation_tokens() concatenates all val shards into a single token tensor and truncates it to a multiple of TRAIN_SEQ_LEN. Evaluation uses this fixed corpus to compute both val_loss (cross-entropy in nats) and val_bpb (bits per byte, tokenizer-agnostic).
Training Shards
Training shards follow the naming patternfineweb_train_*.bin. Training on the first N shards means training on the prefix of the same frozen shuffled export, keeping data order aligned with the baseline for each tokenizer family.
The manifest file (data/manifest.json) records how many shards exist for each variant. The downloader enforces the shard count limit — requesting more shards than are published raises an error.
Downloading Published Data
Usedata/cached_challenge_fineweb.py to download tokenized shards. The script is manifest-driven: it reads manifest.json from the HF repo to discover available shards and tokenizer artifacts for the requested variant.
Download commands
data/datasets/fineweb10B_sp1024/fineweb_train_*.bindata/datasets/fineweb10B_sp1024/fineweb_val_*.bindata/tokenizers/fineweb_1024_bpe.model
Using a custom dataset repository
You can point the downloader at your own Hugging Face dataset repo:Also downloading the source documents
Pass--with-docs to also download docs_selected.jsonl and its sidecar manifest, which are needed for retokenizing with a custom tokenizer:
Tokenizer Specs
The default tokenizer is a SentencePiece BPE model with vocabulary size 1024. Tokenizer metadata is tracked indata/tokenizer_specs.json:
dataset_suffix field (sp1024) maps to the dataset directory name (fineweb10B_sp1024) and the --variant argument accepted by cached_challenge_fineweb.py.
Rebuilding Tokenizers
To retrain a tokenizer or re-export shards from the exact same selected documents, rundata/download_hf_docs_and_tokenize.py against the published docs cache:
docs_selected.source_manifest.json includes a docs_sha256 checksum so you can verify you are rebuilding from exactly the same document list and order as the baseline export.
Submissions that change the tokenizer are examined more carefully during review. If you retokenize, you must prove with certainty that
val_bpb is correctly calculated, since tokenizer bugs can unjustly improve the score.CPU-Heavy Export Knobs
For large-scale shard exports, these environment variables tune tokenization throughput:Data Loading During Training
The training script streams tokens from shards using two classes:TokenStream
Reads shards sequentially and wraps around forever. Shards are sorted by filename and cycled in order. The stream has no randomness or worker threads — it provides deterministic, simple sequential access.
take(n) returns the next n tokens, advancing across shard boundaries as needed and wrapping back to shard 0 after the last shard.
DistributedTokenLoader
Wraps TokenStream for multi-GPU training. Each call to next_batch() takes a contiguous chunk from the stream sized for all ranks, then slices out one disjoint local_tokens + 1 span per rank. The extra token enables (x, y) pair construction by shifting. Gradient accumulation is handled by calling next_batch once per micro-step.
next_batch once per micro-step inside the accumulation loop. Each call consumes a fresh local_tokens + 1 span from the stream.