Skip to main content

Overview

GameGPT is a causal transformer adapted from Andrej Karpathy’s microgpt — a complete GPT implementation in pure Python with custom autograd. No dependencies. The entire model is built from scratch in model.py:
  • Value-based autograd (lines 8-51)
  • Matrix operations and activation functions (lines 54-68)
  • Multi-head causal attention (lines 69-110)
  • Adam optimizer (lines 157-183)
  • Sampling (lines 186-199)
From the README: “Yes, I waited 36 hours while this thing farted itself into existence on my CPU.”

Architecture Specifications

Comparison to the original microgpt:
ParametermicrogptGameGPT
Vocabulary~27 (characters)74 (game events)
Embedding dim1632
Layers12
Context window1664
Heads44
Parameters~1.2K~31K
From README comparison table and model.py:115-121:
class GameGPT:
    def __init__(self, vocab_size=74, n_layer=2, n_embd=32, 
                 block_size=64, n_head=4, seed=42):
        self.vocab_size = vocab_size
        self.n_layer = n_layer
        self.n_embd = n_embd
        self.block_size = block_size
        self.n_head = n_head
        self.head_dim = n_embd // n_head

Parameter Breakdown

74 tokens from vocab.py covering all Snake gameplay tokens (structural, entity, direction, position, event types, values, lengths).
Two transformer blocks (vs. microgpt’s 1). Each block has:
  • Multi-head causal attention
  • RMSNorm
  • ReLU MLP with 4x expansion
  • Residual connections
Embedding dimension. Each token is represented as a 32-dimensional vector. Doubled from microgpt’s 16.
Context window size. The model can attend to up to 64 previous tokens. 4x larger than microgpt’s 16.From README: “Structural validity is low because the model often hits the 64-token context limit mid-sequence without generating EOS.”
Number of attention heads. Same as microgpt.With n_embd=32, each head has dimension 32 / 4 = 8.
Total trainable parameters across:
  • Token embeddings: 74 * 32 = 2,368
  • Position embeddings: 64 * 32 = 2,048
  • 2 transformer layers: ~24K
  • LM head: 74 * 32 = 2,368
From weights.txt, the trained model is 372KB as plain text (8 decimal places per float).

Custom Autograd Implementation

The model uses a custom Value class for automatic differentiation:
class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads')
    
    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads
From model.py:8-15.

Operations

Basic math ops return new Value nodes with gradient functions:
def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    return Value(self.data + other.data, (self, other), (1, 1))

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    return Value(self.data * other.data, (self, other), (other.data, self.data))

def relu(self):
    return Value(max(0, self.data), (self,), (float(self.data > 0),))

def exp(self):
    return Value(math.exp(self.data), (self,), (math.exp(self.data),))

def log(self):
    return Value(math.log(self.data), (self,), (1/self.data,))
From model.py:17-28. Each operation stores its inputs (_children) and local gradients (_local_grads).

Backpropagation

Topological sort + reverse accumulation:
def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    self.grad = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad
From model.py:37-50. This implements reverse-mode autodiff by hand.
No PyTorch, no TensorFlow. Every gradient is computed manually. This makes the codebase fully transparent and educational.

Model Architecture

Token + Position Embeddings

tok_emb = state_dict['wte'][token_id]  # (n_embd,)
pos_emb = state_dict['wpe'][pos_id]    # (n_embd,)
x = [t + p for t, p in zip(tok_emb, pos_emb)]
x = rmsnorm(x)
From model.py:70-73. Each token gets:
  • A learned embedding based on its vocabulary index
  • A learned positional embedding based on its sequence position
RMSNorm is applied immediately.

RMSNorm

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]
From model.py:64-67. Root mean square normalization, cheaper than LayerNorm (no mean centering).

Transformer Layer

Each layer (from model.py:75-106):
for li in range(n_layer):
    x_residual = x
    x = rmsnorm(x)
    
    # Multi-head attention
    q = linear(x, state_dict[f'layer{li}.attn_wq'])
    k = linear(x, state_dict[f'layer{li}.attn_wk'])
    v = linear(x, state_dict[f'layer{li}.attn_wv'])
    keys[li].append(k)
    values[li].append(v)
    
    x_attn = []
    for h in range(n_head):
        # Per-head attention (see below)
        ...
    
    x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
    x = [a + b for a, b in zip(x, x_residual)]  # Residual
    
    # MLP
    x_residual = x
    x = rmsnorm(x)
    x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
    x = [xi.relu() for xi in x]
    x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
    x = [a + b for a, b in zip(x, x_residual)]  # Residual
1

RMSNorm + Attention

Normalize, compute Q/K/V, run multi-head attention, add residual
2

RMSNorm + MLP

Normalize, 4x expansion with ReLU, projection back, add residual

Multi-Head Causal Attention

for h in range(n_head):
    hs = h * head_dim
    q_h = q[hs:hs+head_dim]
    k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
    v_h = [vi[hs:hs+head_dim] for vi in values[li]]
    
    attn_logits = [
        sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
        for t in range(len(k_h))
    ]
    attn_weights = softmax(attn_logits)
    
    head_out = [
        sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
        for j in range(head_dim)
    ]
    x_attn.extend(head_out)
From model.py:84-98. For each head:
  1. Slice Q/K/V to get head-specific dimensions
  2. Compute attention scores (Q·K / √d)
  3. Softmax to get weights
  4. Weighted sum of values
Causality is enforced by only storing keys/values for positions ≤ current position. The keys and values lists grow as the model generates tokens.

Output Head

logits = linear(x, state_dict['lm_head'])
return logits
From model.py:108-109. Final linear layer projects to vocabulary size (74 logits).

Training

The training loop uses cross-entropy loss and Adam optimizer:
def train_step(self, tokens, lr=0.01, beta1=0.85, beta2=0.99, eps=1e-8):
    n = min(self.block_size, len(tokens) - 1)
    if n <= 0:
        return 0.0
    
    keys, values = self.fresh_kv()
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = self.forward(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    
    loss = (1 / n) * sum(losses)
    loss.backward()
    
    # Adam update
    self.step_count += 1
    lr_t = lr * (1 - self.step_count / max(self.step_count + 1, 5000))
    for i, p in enumerate(self.params):
        self.m[i] = beta1 * self.m[i] + (1 - beta1) * p.grad
        self.v[i] = beta2 * self.v[i] + (1 - beta2) * p.grad ** 2
        m_hat = self.m[i] / (1 - beta1 ** self.step_count)
        v_hat = self.v[i] / (1 - beta2 ** self.step_count)
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)
        p.grad = 0
    
    return loss.data
From model.py:157-184.

Loss Computation

1

Forward pass

Run the model on tokens[0:n] to predict tokens[1:n+1]
2

Cross-entropy loss

-log(prob[target]) for each position, averaged
3

Backprop

Call loss.backward() to compute gradients
4

Adam update

Update parameters with momentum and adaptive learning rates

Training Results

From the README:
Loss: 4.47 → 0.25 (random baseline: ln(74) ≈ 4.3) Physical validity: 95% — moves are adjacent cells, positions in bounds Rule validity: 100% — EAT→GROW+FOOD_SPAWN, DIE→EOS
The model learned the game rules from event sequences alone, with no explicit supervision.
Training: 200 episodes, 5000 steps, ~36 hours on CPU. From train.py (not shown), the training script loads episodes from episodes.json and calls train_step() in a loop.

Sampling

Generation uses temperature-controlled sampling:
def sample(self, bos_id, eos_id, temperature=0.5, max_len=None):
    if max_len is None:
        max_len = self.block_size
    keys, values = self.fresh_kv()
    token_id = bos_id
    result = [bos_id]
    for pos_id in range(max_len - 1):
        logits = self.forward(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(
            range(self.vocab_size), 
            weights=[p.data for p in probs]
        )[0]
        result.append(token_id)
        if token_id == eos_id:
            break
    return result
From model.py:186-199.

Temperature

  • temperature < 1.0: Sharper distribution, more deterministic
  • temperature = 1.0: Use raw probabilities
  • temperature > 1.0: Flatter distribution, more random
From sample.py, the default is temperature=0.5.

KV Cache

The keys and values lists accumulate across generation:
def fresh_kv(self):
    return [[] for _ in range(self.n_layer)], [[] for _ in range(self.n_layer)]
From model.py:154-155. Each time a token is processed, its key and value are appended. This implements efficient causal attention without recomputing past positions.

Weight Persistence

Weights are saved as plain text:
def save_weights(self, path):
    """Save model weights as plain text."""
    with open(path, 'w') as f:
        for name, mat in self.state_dict.items():
            for r, row in enumerate(mat):
                vals = ' '.join(f'{p.data:.8f}' for p in row)
                f.write(f'{name}|{r}|{vals}\n')
From model.py:201-207. Format: parameter_name|row_index|space_separated_floats Example:
wte|0|0.05123456 -0.02345678 0.01234567 ...
wte|1|-0.01234567 0.03456789 -0.00123456 ...
No binary formats, no pickle. The entire model is human-inspectable text. See weights.txt in the repo (372KB).

Comparison to Modern Transformers

FeatureGameGPTGPT-2 / GPT-3
EmbeddingsLearnedLearned
AttentionMulti-head causalMulti-head causal
NormalizationRMSNorm (pre-norm)LayerNorm (post-norm in GPT-2, pre-norm in GPT-3)
MLP activationReLUGELU
Positional encodingLearnedLearned
OptimizerAdamAdamW
ImplementationPure PythonPyTorch/JAX
GameGPT uses simpler choices (ReLU, RMSNorm) for implementation simplicity and speed on CPU.

Scaling Plan

From the README roadmap, future scaling directions:
  • Bigger models: More layers, larger embedding dim, more heads
  • Longer context: Increase block_size beyond 64
  • More games: Pac-Man (multi-entity), Survivor (massive scale), Chess (turn-based)
  • Ouroboros: Feed model predictions back into the game
Current bottleneck: “Structural validity is 45% because the model often hits the 64-token context limit mid-sequence.”Longer context window would improve completion rates.

Next Steps

Tokenization

Understand how events become tokens

Theory

Review the Wittgensteinian foundation

Build docs developers (and LLMs) love