If you don’t have a Mac with Apple Silicon, you can ask Codex to refactor
train_gpt_mlx.py to remove the MLX dependency. It may still be slow, so jumping straight to remote GPU training is also a good option.Prerequisites
- Mac with Apple Silicon (M1 or later)
- Python 3.10+
- Git
Setup
Create a virtual environment and install dependencies
Create a fresh Python environment and install the packages needed for the MLX path and dataset download:
Download FineWeb training data
Download the cached FineWeb export using the 1024-token SentencePiece vocabulary. For a quick smoke test, start with 10 shards:This populates
./data/datasets/fineweb10B_sp1024/ and ./data/tokenizers/. See Data setup for full download options.Understanding the output
SettingVAL_LOSS_EVERY=0 skips periodic validation during training. The script prints val_loss and val_bpb once at the very end, after training completes.
Validation always runs on the full fineweb_val_* split — the fixed first-50,000-document set. This is the same set used for leaderboard scoring, so local val_bpb numbers are directly comparable.
Key environment variables
| Variable | Default | Description |
|---|---|---|
RUN_ID | random UUID | Name for this run’s log directory |
ITERATIONS | 20000 | Number of training steps |
TRAIN_BATCH_TOKENS | 524288 | Tokens per training step |
VAL_LOSS_EVERY | 0 (MLX default) | Validate every N steps; 0 = end only. Note: train_gpt.py defaults to 1000. |
VAL_BATCH_SIZE | 524288 | Tokens per validation pass |
MAX_WALLCLOCK_SECONDS | 600 | Hard stop after this many seconds |
DATA_PATH | ./data/datasets/fineweb10B_sp1024 | Path to dataset shards |
TOKENIZER_PATH | ./data/tokenizers/fineweb_1024_bpe.model | Path to tokenizer model |
Next steps
Remote GPU training
Scale up to cloud H100s via Runpod for full leaderboard runs.
Data setup
Download the full 10B token dataset or configure custom tokenizer variants.
