Train GPT-2 (124M) on OpenWebText with distributed data parallel
This guide shows you how to reproduce GPT-2 (124M parameters) training results using the OpenWebText dataset. The training achieves a validation loss of ~2.85 in about 4 days on an 8x A100 40GB node.
GPT-2 (124M) evaluated directly on OpenWebText gets a validation loss of ~3.11, but finetuning brings it down to ~2.85. This indicates a domain gap between OpenWebText and the original (closed) WebText dataset.
For training across multiple nodes with Infiniband interconnect:
# Run on the first (master) node with IP 123.456.123.456:torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \ --master_addr=123.456.123.456 --master_port=1234 train.py# Run on the worker node:torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \ --master_addr=123.456.123.456 --master_port=1234 train.py
If you don’t have Infiniband, prepend NCCL_IB_DISABLE=1 to the commands above. Training will work but will be significantly slower.
The domain gap between WebText (closed) and OpenWebText means a direct GPT-2 (124M) evaluation gives 3.11 validation loss. After finetuning on OpenWebText, it reaches ~2.85, matching our reproduction target.
The “poor man’s data loader” from train.py:114-131 uses memory-mapped files:
def get_batch(split): # Recreate memmap every batch to avoid memory leak data = np.memmap(os.path.join(data_dir, f'{split}.bin'), dtype=np.uint16, mode='r') ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix]) y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix]) if device_type == 'cuda': # Asynchronous GPU transfer x = x.pin_memory().to(device, non_blocking=True) y = y.pin_memory().to(device, non_blocking=True) return x, y