Remote GPU training

Once you’ve validated your setup locally, or you want more compute, switch to a remote CUDA machine. Leaderboard submissions must run in under 10 minutes on 8xH100s (SXM variant). This guide walks through launching a pod on Runpod, which OpenAI is partnering with to make setup as easy as possible.

An 8xH100 pod costs around $20/hour. Test your changes on cheaper GPU SKUs first and only switch to 8xH100s for final leaderboard submissions.

Prerequisites

A Runpod account with billing set up
An SSH key configured in the Runpod Settings tab

Setup

Create a Runpod account and add an SSH key

Sign up at console.runpod.io. Go to Settings and add your SSH public key so you can connect to pods from your terminal.

If you’re new to SSH keys, ask Codex to walk you through generating and adding one.

Launch a 1xH100 pod using the official template

Deploy using the official Parameter Golf template:Launch template on Runpod

Select a 1xH100 GPU for initial experiments
Enable SSH terminal access
Leave all other settings at their defaults
Click Deploy

All Python dependencies are pre-installed in the image — you don’t need to run pip install.

Clone the repository on your remote machine

SSH into your pod once it’s running. You’ll land in /workspace/.

cd /workspace
git clone https://github.com/openai/parameter-golf.git
cd parameter-golf

Download FineWeb training data

Download the cached 1024-token FineWeb export. This defaults to the full validation split plus 80 training shards (8B tokens):

python3 data/cached_challenge_fineweb.py --variant sp1024

To download a smaller subset while iterating, pass --train-shards N:

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1

Launch your first training run

Run a single-GPU training job. The nproc_per_node=1 flag matches the one available GPU:

RUN_ID=baseline_sp1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

For the baseline config, the final val_bpb should land around ~1.2 with a compressed model size under 16MB.

Scaling to 8xH100s

To run a full leaderboard-eligible submission, change nproc_per_node to match the number of GPUs on your pod:

RUN_ID=baseline_sp1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Only use 8xH100 pods for final leaderboard submissions. At ~$20/hour, iterating on this SKU is expensive. Use 1xH100 or cheaper GPUs for all development.

Environment variables

Variable	Default	Description
`RUN_ID`	random UUID	Name for this run’s log directory
`DATA_PATH`	`./data/datasets/fineweb10B_sp1024`	Path to dataset shards
`TOKENIZER_PATH`	`./data/tokenizers/fineweb_1024_bpe.model`	Path to tokenizer model
`VOCAB_SIZE`	`1024`	Vocabulary size, must match tokenizer
`MAX_WALLCLOCK_SECONDS`	`600`	Hard stop in seconds; set to `0` to disable
`VAL_LOSS_EVERY`	`1000`	Print validation loss every N steps
`TRAIN_LOG_EVERY`	`200`	Print training loss every N steps

Understanding the log output

During training, the script prints train_loss at every TRAIN_LOG_EVERY steps. At the end, it prints:

val_loss — cross-entropy loss on the validation set
val_bpb — bits per byte, the leaderboard metric
final_int8_zlib_roundtrip lines — compressed model size in bytes

By default, train_gpt.py enforces a 10-minute wallclock cap (MAX_WALLCLOCK_SECONDS=600). To run longer, set MAX_WALLCLOCK_SECONDS=0 to disable the cap or pass an explicit second count.

Overview

Getting Started

Concepts

Submission Guide

Reference

Prerequisites

Setup

Scaling to 8xH100s

Environment variables

Understanding the log output

Next steps

Data setup

Submission requirements

Overview

Getting Started

Concepts

Submission Guide

Reference

​Prerequisites

​Setup

​Scaling to 8xH100s

​Environment variables

​Understanding the log output

​Next steps

Data setup

Submission requirements

Prerequisites

Setup

Scaling to 8xH100s

Environment variables

Understanding the log output

Next steps