Introduction

Auto-generate your docs

What is nanoGPT?
Architecture overview
model.py (~300 lines)
train.py (~300 lines)
Key features
Performance
Use cases
Next steps

The simplest, fastest repository for training and finetuning medium-sized GPTs. nanoGPT prioritizes simplicity and speed while maintaining the ability to reproduce GPT-2 results.

What is nanoGPT?

nanoGPT is a minimal implementation of GPT (Generative Pre-trained Transformer) that you can use to:

Train from scratch

Build and train your own GPT models on custom datasets

Finetune pretrained models

Adapt GPT-2 checkpoints to your specific use case

Reproduce GPT-2

Recreate OpenAI’s GPT-2 (124M) on OpenWebText

Experiment quickly

Hackable codebase with just ~600 lines of core code

Architecture overview

nanoGPT consists of two main files that contain all the essential functionality:

model.py (~300 lines)

Defines the GPT model architecture with these key components:

Transformer blocks: Causal self-attention and feed-forward layers
GPTConfig: Configuration dataclass for model hyperparameters
GPT class: Main model with training and generation methods

model.py

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True

train.py (~300 lines)

A clean training loop with:

Distributed training: PyTorch DDP support for multi-GPU setups
Mixed precision: Automatic FP16/BF16 training
Configuration system: Override defaults via command line or config files
Checkpoint management: Automatic saving and resuming

Key features

Simple and readable code

The entire codebase is intentionally minimal. The core training loop (train.py) and model definition (model.py) are each around 300 lines, making the code easy to understand and modify.

PyTorch 2.0 optimized

Uses torch.compile() for significant speed improvements. Flash Attention is automatically enabled when available for efficient self-attention computation.

Flexible configuration

Override any training parameter via command line arguments or configuration files. Config files in config/ directory provide tested hyperparameters for different use cases.

GPT-2 compatible

Can load OpenAI’s GPT-2 checkpoints (124M, 350M, 774M, 1558M parameters) for finetuning or evaluation.

Performance

On a single 8xA100 40GB node, nanoGPT can reproduce GPT-2 (124M parameters) on OpenWebText in approximately 4 days, achieving a validation loss of ~2.85. For quick experimentation, you can train a character-level model on Shakespeare in just 3 minutes on a single GPU.

Use cases

nanoGPT is designed for:

Researchers who want to quickly experiment with GPT architectures
Students learning about transformer language models
Developers needing to finetune GPT-2 on domain-specific text
Anyone wanting a hackable, understandable GPT implementation

nanoGPT prioritizes simplicity over features. If you need production-ready inference, extensive model variants, or comprehensive tooling, consider using libraries like Hugging Face Transformers.

Next steps

Installation

Set up your environment and install dependencies

Quickstart

Train your first GPT model in minutes

Installation

⌘I

Build docs developers (and LLMs) love

Get started for free Talk to us

Getting Started

Training

Inference

Configuration

Advanced

What is nanoGPT?

Train from scratch

Finetune pretrained models

Reproduce GPT-2

Experiment quickly

Architecture overview

model.py (~300 lines)

train.py (~300 lines)

Key features

Performance

Use cases

Next steps

Installation

Quickstart

Build docs developers (and LLMs) love

Getting Started

Training

Inference

Configuration

Advanced

​What is nanoGPT?

Train from scratch

Finetune pretrained models

Reproduce GPT-2

Experiment quickly

​Architecture overview

​model.py (~300 lines)

​train.py (~300 lines)

​Key features

​Performance

​Use cases

​Next steps

Installation

Quickstart

Build docs developers (and LLMs) love

What is nanoGPT?

Architecture overview

model.py (~300 lines)

train.py (~300 lines)

Key features

Performance

Use cases

Next steps