Overview
GameGPT is a causal transformer adapted from Andrej Karpathy’s microgpt — a complete GPT implementation in pure Python with custom autograd. No dependencies. The entire model is built from scratch inmodel.py:
- Value-based autograd (lines 8-51)
- Matrix operations and activation functions (lines 54-68)
- Multi-head causal attention (lines 69-110)
- Adam optimizer (lines 157-183)
- Sampling (lines 186-199)
From the README: “Yes, I waited 36 hours while this thing farted itself into existence on my CPU.”
Architecture Specifications
Comparison to the original microgpt:| Parameter | microgpt | GameGPT |
|---|---|---|
| Vocabulary | ~27 (characters) | 74 (game events) |
| Embedding dim | 16 | 32 |
| Layers | 1 | 2 |
| Context window | 16 | 64 |
| Heads | 4 | 4 |
| Parameters | ~1.2K | ~31K |
model.py:115-121:
Parameter Breakdown
vocab_size=74
vocab_size=74
74 tokens from
vocab.py covering all Snake gameplay tokens (structural, entity, direction, position, event types, values, lengths).n_layer=2
n_layer=2
Two transformer blocks (vs. microgpt’s 1). Each block has:
- Multi-head causal attention
- RMSNorm
- ReLU MLP with 4x expansion
- Residual connections
n_embd=32
n_embd=32
Embedding dimension. Each token is represented as a 32-dimensional vector. Doubled from microgpt’s 16.
block_size=64
block_size=64
Context window size. The model can attend to up to 64 previous tokens. 4x larger than microgpt’s 16.From README: “Structural validity is low because the model often hits the 64-token context limit mid-sequence without generating EOS.”
n_head=4
n_head=4
Number of attention heads. Same as microgpt.With
n_embd=32, each head has dimension 32 / 4 = 8.~31K parameters
~31K parameters
Total trainable parameters across:
- Token embeddings:
74 * 32 = 2,368 - Position embeddings:
64 * 32 = 2,048 - 2 transformer layers: ~24K
- LM head:
74 * 32 = 2,368
weights.txt, the trained model is 372KB as plain text (8 decimal places per float).Custom Autograd Implementation
The model uses a customValue class for automatic differentiation:
model.py:8-15.
Operations
Basic math ops return newValue nodes with gradient functions:
model.py:17-28. Each operation stores its inputs (_children) and local gradients (_local_grads).
Backpropagation
Topological sort + reverse accumulation:model.py:37-50. This implements reverse-mode autodiff by hand.
No PyTorch, no TensorFlow. Every gradient is computed manually. This makes the codebase fully transparent and educational.
Model Architecture
Token + Position Embeddings
model.py:70-73. Each token gets:
- A learned embedding based on its vocabulary index
- A learned positional embedding based on its sequence position
RMSNorm
model.py:64-67. Root mean square normalization, cheaper than LayerNorm (no mean centering).
Transformer Layer
Each layer (frommodel.py:75-106):
Multi-Head Causal Attention
model.py:84-98. For each head:
- Slice Q/K/V to get head-specific dimensions
- Compute attention scores (Q·K / √d)
- Softmax to get weights
- Weighted sum of values
keys and values lists grow as the model generates tokens.
Output Head
model.py:108-109. Final linear layer projects to vocabulary size (74 logits).
Training
The training loop uses cross-entropy loss and Adam optimizer:model.py:157-184.
Loss Computation
Training Results
From the README:Loss: 4.47 → 0.25 (random baseline: ln(74) ≈ 4.3) Physical validity: 95% — moves are adjacent cells, positions in bounds Rule validity: 100% — EAT→GROW+FOOD_SPAWN, DIE→EOSThe model learned the game rules from event sequences alone, with no explicit supervision.
Training: 200 episodes, 5000 steps, ~36 hours on CPU. From
train.py (not shown), the training script loads episodes from episodes.json and calls train_step() in a loop.Sampling
Generation uses temperature-controlled sampling:model.py:186-199.
Temperature
temperature < 1.0: Sharper distribution, more deterministictemperature = 1.0: Use raw probabilitiestemperature > 1.0: Flatter distribution, more random
sample.py, the default is temperature=0.5.
KV Cache
Thekeys and values lists accumulate across generation:
model.py:154-155. Each time a token is processed, its key and value are appended. This implements efficient causal attention without recomputing past positions.
Weight Persistence
Weights are saved as plain text:model.py:201-207.
Format: parameter_name|row_index|space_separated_floats
Example:
No binary formats, no pickle. The entire model is human-inspectable text. See
weights.txt in the repo (372KB).Comparison to Modern Transformers
| Feature | GameGPT | GPT-2 / GPT-3 |
|---|---|---|
| Embeddings | Learned | Learned |
| Attention | Multi-head causal | Multi-head causal |
| Normalization | RMSNorm (pre-norm) | LayerNorm (post-norm in GPT-2, pre-norm in GPT-3) |
| MLP activation | ReLU | GELU |
| Positional encoding | Learned | Learned |
| Optimizer | Adam | AdamW |
| Implementation | Pure Python | PyTorch/JAX |
Scaling Plan
From the README roadmap, future scaling directions:- Bigger models: More layers, larger embedding dim, more heads
- Longer context: Increase
block_sizebeyond 64 - More games: Pac-Man (multi-entity), Survivor (massive scale), Chess (turn-based)
- Ouroboros: Feed model predictions back into the game
Next Steps
Tokenization
Understand how events become tokens
Theory
Review the Wittgensteinian foundation
