Tokenizer variants
The baseline usessp1024 — a 1024-token SentencePiece BPE vocabulary trained on FineWeb documents. The --variant flag selects which tokenizer family to download. Submissions that change the tokenizer will be examined carefully during review, since bugs may affect val_bpb scores.
Downloading data
The download script isdata/cached_challenge_fineweb.py. Pass --variant and optionally --train-shards to control how much training data to fetch.
Download for local smoke testing (1 shard)
For quick local experiments, download a single training shard:
Download the standard 8B token set (default)
The default download fetches the full validation split plus 80 training shards (8B tokens):
Each shard contains 100,000,000 tokens. 80 shards = 8B tokens; 100 shards = 10B tokens.
Directory layout after download
After running the downloader, your local layout looks like this:Validation split
The validation set is always the fixed first-50,000-document slice of FineWeb, stored infineweb_val_* shards. It is always downloaded in full regardless of --train-shards. All val_bpb scores — local and leaderboard — are computed on this same split.
Using a custom dataset repository
If you have exported your own dataset to Hugging Face (for example, a 50B token export with a custom tokenizer), override the repo and path with environment variables:willdepueoai/parameter-golf, rooted under the datasets/ subdirectory. The downloader is manifest-driven, so it fetches only the prefix of shards you request from a larger export.
Rebuilding tokenizers
To retokenize from scratch, you first need the source documents. Pass--with-docs to the downloader to also fetch docs_selected.jsonl and its sidecar manifest:
docs_selected.source_manifest.json includes a docs_sha256 field so you can verify you are rebuilding from the exact same document list and order as the baseline export.
Performance knobs
For CPU-heavy export jobs, the following environment variables control parallelism and batch sizes during shard tokenization:| Variable | Description |
|---|---|
MATCHED_FINEWEB_SP_BATCH_SIZE | Batch size for SentencePiece encoding (default: 2048) |
MATCHED_FINEWEB_TOKENIZER_THREADS | Thread count for tokenizer encoding (default: 16) |
MATCHED_FINEWEB_TIKTOKEN_THREADS | Thread count for tiktoken encoding (default: 16) |
MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE | Batch size for GPT-2 decode in blobstore path (default: 512) |
Next steps
Local training
Run your first training job on Apple Silicon with the downloaded data.
Remote GPU training
Scale up to cloud H100s for full leaderboard runs.
