Quickstart

The fastest way to get started with nanoGPT is to train a character-level GPT on the works of Shakespeare. This tutorial will walk you through the complete process.

Overview

You’ll learn how to:

Prepare the Shakespeare dataset for training
Train a small GPT model from scratch
Generate text samples from your trained model
Adjust hyperparameters for different hardware

This quickstart uses character-level modeling (not BPE tokens) for simplicity. The entire dataset is just 1MB and training takes only a few minutes.

Prepare the dataset

First, download and prepare the Shakespeare dataset:

Run the data preparation script

The preparation script downloads the tiny Shakespeare dataset and converts it to binary format:

python data/shakespeare_char/prepare.py

This script:

Downloads the complete works of Shakespeare (~1MB text file)
Creates a character-to-integer mapping
Splits the data into training (90%) and validation (10%) sets
Saves train.bin and val.bin in data/shakespeare_char/

Verify the output

You should see output similar to:

length of dataset in characters: 1,115,394
all the unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens

The dataset contains 65 unique characters including uppercase, lowercase, and punctuation.

Train the model

Now you can train a GPT model on this data. The training approach depends on your hardware.

With GPU
CPU / Low-end hardware

If you have a GPU, you can train a small but capable model using the provided config file:

python train.py config/train_shakespeare_char.py

Model architecture

The config file config/train_shakespeare_char.py defines a “baby GPT” with:

# Model hyperparameters
n_layer = 6        # 6 transformer layers
n_head = 6         # 6 attention heads
n_embd = 384       # 384-dimensional embeddings
dropout = 0.2      # 20% dropout for regularization

# Training hyperparameters
batch_size = 64
block_size = 256   # Context of up to 256 characters
max_iters = 5000   # 5000 training iterations
learning_rate = 1e-3

Training progress

On an A100 GPU, this takes about 3 minutes and achieves a validation loss around 1.47:

step 0: train loss 4.2253, val loss 4.2272
step 250: train loss 1.8324, val loss 1.9012
step 500: train loss 1.5892, val loss 1.6845
...
step 4750: train loss 1.3012, val loss 1.4721
step 5000: train loss 1.2945, val loss 1.4697

Model checkpoints are saved to out-shakespeare-char/ directory.

On Apple Silicon Macs, add --device=mps to use the GPU for 2-3x speedup over CPU.

If you only have a CPU or limited resources, use a smaller model configuration:

python train.py config/train_shakespeare_char.py \
  --device=cpu \
  --compile=False \
  --eval_iters=20 \
  --log_interval=1 \
  --block_size=64 \
  --batch_size=12 \
  --n_layer=4 \
  --n_head=4 \
  --n_embd=128 \
  --max_iters=2000 \
  --lr_decay_iters=2000 \
  --dropout=0.0

Parameter explanations

device

string

default:"cuda"

Set to cpu to run on CPU instead of GPU

compile

boolean

default:"true"

Disable PyTorch 2.0 compilation (not supported on all platforms)

block_size

integer

default:"256"

Reduced to 64 characters of context to save memory

batch_size

integer

default:"64"

Reduced to 12 examples per batch to fit in CPU memory

n_layer

integer

default:"6"

Reduced to 4 transformer layers for faster training

n_embd

integer

default:"384"

Reduced to 128-dimensional embeddings

This configuration trains in about 3 minutes on CPU and achieves validation loss around 1.88.

Generate text samples

Once training completes, generate Shakespeare-style text from your model:

python sample.py --out_dir=out-shakespeare-char

If you trained on CPU, add the --device=cpu flag:

python sample.py --out_dir=out-shakespeare-char --device=cpu

Example output

With the GPU-trained model (validation loss 1.47), you might see:

ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

DUKE VINCENTIO:
I thank your eyes against it.

DUKE VINCENTIO:
Then will answer him to save the malm:
And what have you tyrannous shall do this?

Not bad for a character-level model trained in just 3 minutes!

Customize generation

You can control the generation process with additional parameters:

python sample.py \
  --out_dir=out-shakespeare-char \
  --start="ROMEO:" \
  --num_samples=5 \
  --max_new_tokens=200 \
  --temperature=0.8 \
  --top_k=200

start

string

default:"\\n"

The prompt to start generation. Can also load from a file with FILE:prompt.txt

num_samples

integer

default:"10"

Number of independent samples to generate

max_new_tokens

integer

default:"500"

Maximum number of tokens to generate per sample

temperature

float

default:"0.8"

Sampling temperature. Lower values (0.6-0.8) are more conservative, higher values (1.0+) more creative

top_k

integer

default:"200"

Only sample from the top k most likely tokens at each step

Understanding the training loop

Let’s examine what happens during training. The core training loop in train.py follows this pattern:

train.py

# Training loop
while True:
    # Get learning rate for this iteration (cosine decay with warmup)
    lr = get_lr(iter_num) if decay_lr else learning_rate
    
    # Evaluate on train/val sets periodically
    if iter_num % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        
        # Save checkpoint if validation improved
        if losses['val'] < best_val_loss:
            torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    
    # Forward pass with mixed precision
    with ctx:
        logits, loss = model(X, Y)
    
    # Backward pass and optimizer step
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    iter_num += 1
    if iter_num > max_iters:
        break

Next steps

Finetune GPT-2

Learn how to finetune pretrained GPT-2 models on your own data for better results

Training configuration

Explore all available hyperparameters and configuration options

Character-level training

Deep dive into character-level model training and configuration

Distributed training

Scale up to multi-GPU training with PyTorch DDP

Common issues

Training is very slow

Ensure you’re using a GPU with --device=cuda (or --device=mps on Mac)
Verify PyTorch 2.0+ is installed to enable torch.compile()
Check that compilation is enabled (don’t use --compile=False on GPU)

Out of memory errors

Reduce memory usage by:

python train.py config/train_shakespeare_char.py \
  --batch_size=32 \
  --block_size=128 \
  --n_layer=4 \
  --n_embd=256

Loss not decreasing

Verify the data preparation completed successfully
Check that train.bin and val.bin exist in data/shakespeare_char/
Try increasing the learning rate or reducing dropout

Generated text is nonsense

The model may need more training iterations
Try lowering the temperature: --temperature=0.7
Check that you’re loading the correct checkpoint with --out_dir

Getting Started

Training

Inference

Configuration

Advanced

Overview

Prepare the dataset

Train the model

Model architecture

Training progress

Parameter explanations

Generate text samples

Example output

Customize generation

Understanding the training loop

Next steps

Finetune GPT-2

Training configuration

Character-level training

Distributed training

Common issues

Build docs developers (and LLMs) love

Getting Started

Training

Inference

Configuration

Advanced

​Overview

​Prepare the dataset

​Train the model

​Model architecture

​Training progress

​Parameter explanations

​Generate text samples

​Example output

​Customize generation

​Understanding the training loop

​Next steps

Finetune GPT-2

Training configuration

Character-level training

Distributed training

​Common issues

Build docs developers (and LLMs) love

Overview

Prepare the dataset

Train the model

Model architecture

Training progress

Parameter explanations

Generate text samples

Example output

Customize generation

Understanding the training loop

Next steps

Common issues