Benchmarking

The bench.py script provides a simplified benchmarking tool to measure training performance without the overhead of data loading, logging, and checkpointing.

Overview

bench.py is a stripped-down version of train.py that focuses on the core training loop for accurate performance measurement.

From README.md:207: “For simple model benchmarking and profiling, bench.py might be useful. It’s identical to what happens in the meat of the training loop of train.py, but omits much of the other complexities.”

Basic benchmarking

Run a simple benchmark

Execute bench.py with default settings:

python bench.py

This will:

Run 10 warmup iterations (burnin phase)
Run 20 benchmark iterations
Report average time per iteration and MFU

Benchmark output

Expected output:

Compiling model...
number of parameters: 85.00M
0/10 loss: 10.9596
1/10 loss: 10.9321
...
time per iteration: 135.42ms, MFU: 45.23%

Configuration options

Customize benchmark parameters by modifying the script or using the configurator.

Key parameters

From bench.py:12-20:

batch_size = 12
block_size = 1024
bias = False
real_data = True
seed = 1337
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
compile = True # use PyTorch 2.0 to compile the model to be faster
profile = False # use pytorch profiler, or just simple benchmarking?

Parameter descriptions

batch_size: Number of sequences per batch

Default: 12
Increase if you have more GPU memory
Directly impacts throughput

block_size: Context length (sequence length)

Default: 1024
Memory usage scales quadratically with attention
Larger values = more memory required

bias: Include bias in Linear and LayerNorm layers

Default: False
False is faster and uses less memory

real_data: Use actual OpenWebText data vs random data

Default: True
Set to False to isolate model performance from I/O

compile: Enable PyTorch 2.0 compilation

Default: True
Significant speedup (~2x) when enabled

profile: Use PyTorch profiler for detailed metrics

Default: False
Enable for detailed profiling analysis

Override via command line

Use the configurator to override parameters:

python bench.py --batch_size=8 --compile=False --device=cuda:1

Benchmarking modes

Simple benchmarking (default)

The default mode runs a warmup phase followed by timed iterations (bench.py:98-117):

# simple benchmarking
torch.cuda.synchronize()
for stage, num_steps in enumerate([10, 20]): # burnin, then benchmark
    t0 = time.time()
    X, Y = get_batch('train')
    for k in range(num_steps):
        with ctx:
            logits, loss = model(X, Y)
        X, Y = get_batch('train')
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
        lossf = loss.item()
        print(f"{k}/{num_steps} loss: {lossf:.4f}")
    torch.cuda.synchronize()
    t1 = time.time()
    dt = t1-t0
    mfu = model.estimate_mfu(batch_size * 1 * num_steps, dt)
    if stage == 1:
        print(f"time per iteration: {dt/num_steps*1000:.4f}ms, MFU: {mfu*100:.2f}%")

The two-stage approach ensures that compilation overhead and cache warmup don’t skew results.

PyTorch Profiler mode

Enable detailed profiling for analysis in TensorBoard:

python bench.py --profile=True

The profiler configuration (bench.py:66-94):

if profile:
    wait, warmup, active = 5, 5, 5
    num_steps = wait + warmup + active
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        schedule=torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./bench_log'),
        record_shapes=False,
        profile_memory=False,
        with_stack=False,
        with_flops=True,
        with_modules=False,
    ) as prof:
        # ... training loop ...
        prof.step() # notify the profiler at end of each step

View profiler results

After profiling, view results in TensorBoard:

tensorboard --logdir=./bench_log

The profiler provides detailed breakdowns of:

Time spent in each operation
Memory usage patterns
GPU kernel efficiency
FLOPS achieved

Data loading options

Real data (default)

Use actual OpenWebText data (bench.py:33-43):

if real_data:
    dataset = 'openwebtext'
    data_dir = os.path.join('data', dataset)
    train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
    def get_batch(split):
        data = train_data
        ix = torch.randint(len(data) - block_size, (batch_size,))
        x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
        y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
        return x, y

Fixed random data

Isolate model performance from data loading (bench.py:44-48):

else:
    # alternatively, if fixed data is desired to not care about data loading
    x = torch.randint(50304, (batch_size, block_size), device=device)
    y = torch.randint(50304, (batch_size, block_size), device=device)
    get_batch = lambda split: (x, y)

Random data eliminates data loading overhead but may not represent real training performance accurately.

Model configuration

The benchmark uses a GPT-2 124M sized model by default (bench.py:51-56):

gptconf = GPTConfig(
    block_size = block_size,
    n_layer = 12, n_head = 12, n_embd = 768,
    dropout = 0,
    bias = bias,
)

Benchmark different model sizes

Modify the configuration to test different architectures:

GPT-2 model sizes

GPT-2 Small (124M)

n_layer = 12, n_head = 12, n_embd = 768

GPT-2 Medium (350M)

n_layer = 24, n_head = 16, n_embd = 1024

GPT-2 Large (774M)

n_layer = 36, n_head = 20, n_embd = 1280

GPT-2 XL (1.5B)

n_layer = 48, n_head = 25, n_embd = 1600

Interpreting results

Time per iteration

Measures the wall-clock time for one training step:

Less than 100ms: Excellent performance
100-200ms: Good performance
Greater than 200ms: May indicate configuration issues

Model FLOPs Utilization (MFU)

Percentage of theoretical peak FLOPS achieved:

Greater than 50%: Excellent utilization
40-50%: Good utilization
Less than 40%: Room for optimization

MFU is calculated relative to A100 GPU peak performance (312 TFLOPS for bfloat16).

Common benchmarking scenarios

Compare compile on/off

# With compile
python bench.py --compile=True

# Without compile  
python bench.py --compile=False

Expected improvement: ~2x faster with compile enabled.

Test different precisions

# bfloat16 (fastest on A100/H100)
python bench.py --dtype=bfloat16

# float16 (fast on most GPUs)
python bench.py --dtype=float16

# float32 (baseline)
python bench.py --dtype=float32

Memory-limited benchmarking

Reduce batch size and block size for smaller GPUs:

python bench.py --batch_size=4 --block_size=512

Multi-GPU benchmarking

Test on different GPUs:

# GPU 0
python bench.py --device=cuda:0

# GPU 1
python bench.py --device=cuda:1

Benchmark best practices

Run multiple times: Performance can vary between runs
Warm GPU: First run may be slower due to GPU initialization
Close other processes: Ensure GPU is not being used by other tasks
Monitor temperature: GPU throttling can affect results
Consistent settings: Use same batch_size/block_size for fair comparisons

For the most accurate results, run benchmarks multiple times and take the average of the middle runs (excluding the first warm-up run).

Getting Started

Training

Inference

Configuration

Advanced

Overview

Basic benchmarking

Run a simple benchmark

Benchmark output

Configuration options

Key parameters

Override via command line

Benchmarking modes

Simple benchmarking (default)

PyTorch Profiler mode

View profiler results

Data loading options

Real data (default)

Fixed random data

Model configuration

Benchmark different model sizes

Interpreting results

Time per iteration

Model FLOPs Utilization (MFU)

Common benchmarking scenarios

Compare compile on/off

Test different precisions

Memory-limited benchmarking

Multi-GPU benchmarking

Benchmark best practices

Build docs developers (and LLMs) love

Getting Started

Training

Inference

Configuration

Advanced

​Overview

​Basic benchmarking

​Run a simple benchmark

​Benchmark output

​Configuration options

​Key parameters

​Override via command line

​Benchmarking modes

​Simple benchmarking (default)

​PyTorch Profiler mode

​View profiler results

​Data loading options

​Real data (default)

​Fixed random data

​Model configuration

​Benchmark different model sizes

​Interpreting results

​Time per iteration

​Model FLOPs Utilization (MFU)

​Common benchmarking scenarios

​Compare compile on/off

​Test different precisions

​Memory-limited benchmarking

​Multi-GPU benchmarking

​Benchmark best practices

Build docs developers (and LLMs) love

Overview

Basic benchmarking

Run a simple benchmark

Benchmark output

Configuration options

Key parameters

Override via command line

Benchmarking modes

Simple benchmarking (default)

PyTorch Profiler mode

View profiler results

Data loading options

Real data (default)

Fixed random data

Model configuration

Benchmark different model sizes

Interpreting results

Time per iteration

Model FLOPs Utilization (MFU)

Common benchmarking scenarios

Compare compile on/off

Test different precisions

Memory-limited benchmarking

Multi-GPU benchmarking

Benchmark best practices