Skip to main content
The fastest way to get started with nanoGPT is to train a character-level GPT on the works of Shakespeare. This tutorial will walk you through the complete process.

Overview

You’ll learn how to:
  • Prepare the Shakespeare dataset for training
  • Train a small GPT model from scratch
  • Generate text samples from your trained model
  • Adjust hyperparameters for different hardware
This quickstart uses character-level modeling (not BPE tokens) for simplicity. The entire dataset is just 1MB and training takes only a few minutes.

Prepare the dataset

First, download and prepare the Shakespeare dataset:
1

Run the data preparation script

The preparation script downloads the tiny Shakespeare dataset and converts it to binary format:
python data/shakespeare_char/prepare.py
This script:
  • Downloads the complete works of Shakespeare (~1MB text file)
  • Creates a character-to-integer mapping
  • Splits the data into training (90%) and validation (10%) sets
  • Saves train.bin and val.bin in data/shakespeare_char/
2

Verify the output

You should see output similar to:
length of dataset in characters: 1,115,394
all the unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens
The dataset contains 65 unique characters including uppercase, lowercase, and punctuation.

Train the model

Now you can train a GPT model on this data. The training approach depends on your hardware.
If you have a GPU, you can train a small but capable model using the provided config file:
python train.py config/train_shakespeare_char.py

Model architecture

The config file config/train_shakespeare_char.py defines a “baby GPT” with:
# Model hyperparameters
n_layer = 6        # 6 transformer layers
n_head = 6         # 6 attention heads
n_embd = 384       # 384-dimensional embeddings
dropout = 0.2      # 20% dropout for regularization

# Training hyperparameters
batch_size = 64
block_size = 256   # Context of up to 256 characters
max_iters = 5000   # 5000 training iterations
learning_rate = 1e-3

Training progress

On an A100 GPU, this takes about 3 minutes and achieves a validation loss around 1.47:
step 0: train loss 4.2253, val loss 4.2272
step 250: train loss 1.8324, val loss 1.9012
step 500: train loss 1.5892, val loss 1.6845
...
step 4750: train loss 1.3012, val loss 1.4721
step 5000: train loss 1.2945, val loss 1.4697
Model checkpoints are saved to out-shakespeare-char/ directory.
On Apple Silicon Macs, add --device=mps to use the GPU for 2-3x speedup over CPU.

Generate text samples

Once training completes, generate Shakespeare-style text from your model:
python sample.py --out_dir=out-shakespeare-char
If you trained on CPU, add the --device=cpu flag:
python sample.py --out_dir=out-shakespeare-char --device=cpu

Example output

With the GPU-trained model (validation loss 1.47), you might see:
ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

DUKE VINCENTIO:
I thank your eyes against it.

DUKE VINCENTIO:
Then will answer him to save the malm:
And what have you tyrannous shall do this?
Not bad for a character-level model trained in just 3 minutes!

Customize generation

You can control the generation process with additional parameters:
python sample.py \
  --out_dir=out-shakespeare-char \
  --start="ROMEO:" \
  --num_samples=5 \
  --max_new_tokens=200 \
  --temperature=0.8 \
  --top_k=200
start
string
default:"\\n"
The prompt to start generation. Can also load from a file with FILE:prompt.txt
num_samples
integer
default:"10"
Number of independent samples to generate
max_new_tokens
integer
default:"500"
Maximum number of tokens to generate per sample
temperature
float
default:"0.8"
Sampling temperature. Lower values (0.6-0.8) are more conservative, higher values (1.0+) more creative
top_k
integer
default:"200"
Only sample from the top k most likely tokens at each step

Understanding the training loop

Let’s examine what happens during training. The core training loop in train.py follows this pattern:
train.py
# Training loop
while True:
    # Get learning rate for this iteration (cosine decay with warmup)
    lr = get_lr(iter_num) if decay_lr else learning_rate
    
    # Evaluate on train/val sets periodically
    if iter_num % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        
        # Save checkpoint if validation improved
        if losses['val'] < best_val_loss:
            torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    
    # Forward pass with mixed precision
    with ctx:
        logits, loss = model(X, Y)
    
    # Backward pass and optimizer step
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    iter_num += 1
    if iter_num > max_iters:
        break

Next steps

Finetune GPT-2

Learn how to finetune pretrained GPT-2 models on your own data for better results

Training configuration

Explore all available hyperparameters and configuration options

Character-level training

Deep dive into character-level model training and configuration

Distributed training

Scale up to multi-GPU training with PyTorch DDP

Common issues

  • Ensure you’re using a GPU with --device=cuda (or --device=mps on Mac)
  • Verify PyTorch 2.0+ is installed to enable torch.compile()
  • Check that compilation is enabled (don’t use --compile=False on GPU)
Reduce memory usage by:
python train.py config/train_shakespeare_char.py \
  --batch_size=32 \
  --block_size=128 \
  --n_layer=4 \
  --n_embd=256
  • Verify the data preparation completed successfully
  • Check that train.bin and val.bin exist in data/shakespeare_char/
  • Try increasing the learning rate or reducing dropout
  • The model may need more training iterations
  • Try lowering the temperature: --temperature=0.7
  • Check that you’re loading the correct checkpoint with --out_dir

Build docs developers (and LLMs) love