Skip to main content
If you have a Mac with Apple Silicon, the MLX training script lets you iterate locally before scaling to cloud GPUs. This is a good way to validate your setup and experiment with small runs cheaply.
If you don’t have a Mac with Apple Silicon, you can ask Codex to refactor train_gpt_mlx.py to remove the MLX dependency. It may still be slow, so jumping straight to remote GPU training is also a good option.

Prerequisites

  • Mac with Apple Silicon (M1 or later)
  • Python 3.10+
  • Git

Setup

1

Clone the repository

git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
2

Create a virtual environment and install dependencies

Create a fresh Python environment and install the packages needed for the MLX path and dataset download:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install mlx numpy sentencepiece huggingface-hub datasets tqdm
3

Download FineWeb training data

Download the cached FineWeb export using the 1024-token SentencePiece vocabulary. For a quick smoke test, start with 10 shards:
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
This populates ./data/datasets/fineweb10B_sp1024/ and ./data/tokenizers/. See Data setup for full download options.
4

Run your first training job

Launch a small 200-iteration smoke run:
RUN_ID=mlx_smoke \
ITERATIONS=200 \
TRAIN_BATCH_TOKENS=8192 \
VAL_LOSS_EVERY=0 \
VAL_BATCH_SIZE=8192 \
python3 train_gpt_mlx.py

Understanding the output

Setting VAL_LOSS_EVERY=0 skips periodic validation during training. The script prints val_loss and val_bpb once at the very end, after training completes. Validation always runs on the full fineweb_val_* split — the fixed first-50,000-document set. This is the same set used for leaderboard scoring, so local val_bpb numbers are directly comparable.

Key environment variables

VariableDefaultDescription
RUN_IDrandom UUIDName for this run’s log directory
ITERATIONS20000Number of training steps
TRAIN_BATCH_TOKENS524288Tokens per training step
VAL_LOSS_EVERY0 (MLX default)Validate every N steps; 0 = end only. Note: train_gpt.py defaults to 1000.
VAL_BATCH_SIZE524288Tokens per validation pass
MAX_WALLCLOCK_SECONDS600Hard stop after this many seconds
DATA_PATH./data/datasets/fineweb10B_sp1024Path to dataset shards
TOKENIZER_PATH./data/tokenizers/fineweb_1024_bpe.modelPath to tokenizer model
For faster iteration on Apple Silicon, reduce TRAIN_BATCH_TOKENS and VAL_BATCH_SIZE to 8192 as shown in the smoke command above. This makes each step much faster at the cost of noisier gradient estimates.

Next steps

Remote GPU training

Scale up to cloud H100s via Runpod for full leaderboard runs.

Data setup

Download the full 10B token dataset or configure custom tokenizer variants.