Skip to main content
Parameter Golf uses a cached, retokenized version of the FineWeb dataset. The downloader fetches pre-tokenized binary shards and tokenizer models from a Hugging Face repository, so you don’t need to process raw text yourself.

Tokenizer variants

The baseline uses sp1024 — a 1024-token SentencePiece BPE vocabulary trained on FineWeb documents. The --variant flag selects which tokenizer family to download. Submissions that change the tokenizer will be examined carefully during review, since bugs may affect val_bpb scores.

Downloading data

The download script is data/cached_challenge_fineweb.py. Pass --variant and optionally --train-shards to control how much training data to fetch.
1

Download for local smoke testing (1 shard)

For quick local experiments, download a single training shard:
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1
2

Download the standard 8B token set (default)

The default download fetches the full validation split plus 80 training shards (8B tokens):
python3 data/cached_challenge_fineweb.py --variant sp1024
3

Download the full 10B token set

For the maximum training data available, fetch 100 shards:
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100
Each shard contains 100,000,000 tokens. 80 shards = 8B tokens; 100 shards = 10B tokens.

Directory layout after download

After running the downloader, your local layout looks like this:
data/
  datasets/
    fineweb10B_sp1024/
      fineweb_train_*.bin   # Training shards
      fineweb_val_*.bin     # Validation shards (fixed set)
  tokenizers/
    fineweb_1024_bpe.model  # SentencePiece model
  manifest.json
  docs_selected.jsonl
  docs_selected.source_manifest.json

Validation split

The validation set is always the fixed first-50,000-document slice of FineWeb, stored in fineweb_val_* shards. It is always downloaded in full regardless of --train-shards. All val_bpb scores — local and leaderboard — are computed on this same split.

Using a custom dataset repository

If you have exported your own dataset to Hugging Face (for example, a 50B token export with a custom tokenizer), override the repo and path with environment variables:
MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \
MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100
The default published repo is willdepueoai/parameter-golf, rooted under the datasets/ subdirectory. The downloader is manifest-driven, so it fetches only the prefix of shards you request from a larger export.

Rebuilding tokenizers

To retokenize from scratch, you first need the source documents. Pass --with-docs to the downloader to also fetch docs_selected.jsonl and its sidecar manifest:
python3 data/cached_challenge_fineweb.py --variant sp1024 --with-docs
Then run the standalone retokenizer against the downloaded docs:
python3 data/download_hf_docs_and_tokenize.py \
  --repo-id your-hf-username/your-dataset-repo \
  --remote-root your_50B_export_root \
  --output-root /tmp/my_custom_tokenizer_export \
  --tokenizer-config ./data/tokenizer_specs.json
The sidecar docs_selected.source_manifest.json includes a docs_sha256 field so you can verify you are rebuilding from the exact same document list and order as the baseline export.

Performance knobs

For CPU-heavy export jobs, the following environment variables control parallelism and batch sizes during shard tokenization:
VariableDescription
MATCHED_FINEWEB_SP_BATCH_SIZEBatch size for SentencePiece encoding (default: 2048)
MATCHED_FINEWEB_TOKENIZER_THREADSThread count for tokenizer encoding (default: 16)
MATCHED_FINEWEB_TIKTOKEN_THREADSThread count for tiktoken encoding (default: 16)
MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZEBatch size for GPT-2 decode in blobstore path (default: 512)
Example with custom parallelism:
MATCHED_FINEWEB_SP_BATCH_SIZE=2048 \
MATCHED_FINEWEB_TOKENIZER_THREADS=16 \
MATCHED_FINEWEB_TIKTOKEN_THREADS=16 \
MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512 \
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100

Next steps

Local training

Run your first training job on Apple Silicon with the downloaded data.

Remote GPU training

Scale up to cloud H100s for full leaderboard runs.