Bits-per-byte (BPB)
Cross-entropy loss is measured in nats (natural log units). BPB converts that into a tokenizer-agnostic compression rate measured per byte of source text rather than per token:val_loss— token-level cross-entropy loss in natslog(2)— converts nats to bitstokens / bytes— the tokenizer’s compression ratio for the validation set
Why BPB instead of validation loss
A model with a 1024-token vocabulary produces far fewer tokens per document than a model with a 50 000-token vocabulary. Comparing raw token-level losses would unfairly penalise small vocabularies, since each token must carry more information. BPB corrects for this: a small-vocabulary tokenizer produces fewer bits per token (each token covers fewer bytes), while a large-vocabulary tokenizer produces more bits per token (each token covers more bytes). Thetokens / bytes factor in the formula cancels that out, making scores comparable across any tokenizer.
BPB is the standard metric for lossless data-compression benchmarks. A BPB of 1.0 would mean the model compresses text to the same size as raw ASCII. State-of-the-art LLMs typically reach ~0.9 BPB on web text.
The eval_val() function
The scoring logic lives in eval_val() in train_gpt.py. It accumulates three distributed counters across all ranks:
build_sentencepiece_luts():
▁ prefix (SentencePiece word-boundary marker) are counted as one extra byte — but only when the preceding token is not a boundary token (control/unknown/unused). This correctly handles the space that SentencePiece absorbs into the following token.
After all batches are reduced across ranks:
The 16 MB artifact limit
The submission artifact is the sum of two quantities:15,815,847bytes for the compressed model47,642bytes for the training script15,863,489bytes total — safely under the limit
What counts toward the artifact
Counted
The
train_gpt.py script (UTF-8 encoded byte length) plus the final_model.int8.ptz compressed model file.Not counted
External downloads, network calls, or training dataset access during evaluation. These are also not allowed.
The final scoring metric
Leaderboard scores come from the post-quantization roundtrip evaluation, logged as:- Quantized to int8 per-row with zlib compression
- Saved to
final_model.int8.ptz - Decompressed and dequantized back to bf16/fp32
- Evaluated on the full FineWeb validation split
val_bpb.
Time limits
Training
Maximum 10 minutes wall-clock on 8×H100 SXM. The script enforces this with
MAX_WALLCLOCK_SECONDS=600.Evaluation
Also under 10 minutes on 8×H100 SXM. This limit is in addition to the training time limit.
Submission verification
OpenAI does not automatically verify every submission but will verify top leaderboard entries. Non-reproducible results can be disqualified.SOTA record requirements
Beat SOTA by ≥ 0.005 nats
New records must exceed the existing SOTA by at least 0.005 nats. Inter-run variance means multiple run logs are required.
Statistical significance: p < 0.01
Provide enough run logs to demonstrate at p < 0.01 significance that the 0.005-nat improvement is real. This requirement is waived for pure systems optimisations that do not change the ML.
Reproducible in under 10 minutes
The submitted
train_gpt.py must successfully compile and run on 8×H100 SXM in under 10 minutes.