Transformer Architecture

Overview

GameGPT is a causal transformer adapted from Andrej Karpathy’s microgpt — a complete GPT implementation in pure Python with custom autograd. No dependencies. The entire model is built from scratch in model.py:

Value-based autograd (lines 8-51)
Matrix operations and activation functions (lines 54-68)
Multi-head causal attention (lines 69-110)
Adam optimizer (lines 157-183)
Sampling (lines 186-199)

From the README: “Yes, I waited 36 hours while this thing farted itself into existence on my CPU.”

Architecture Specifications

Comparison to the original microgpt:

Parameter	microgpt	GameGPT
Vocabulary	~27 (characters)	74 (game events)
Embedding dim	16	32
Layers	1	2
Context window	16	64
Heads	4	4
Parameters	~1.2K	~31K

From README comparison table and model.py:115-121:

class GameGPT:
    def __init__(self, vocab_size=74, n_layer=2, n_embd=32, 
                 block_size=64, n_head=4, seed=42):
        self.vocab_size = vocab_size
        self.n_layer = n_layer
        self.n_embd = n_embd
        self.block_size = block_size
        self.n_head = n_head
        self.head_dim = n_embd // n_head

Parameter Breakdown

vocab_size=74

74 tokens from vocab.py covering all Snake gameplay tokens (structural, entity, direction, position, event types, values, lengths).

n_layer=2

Two transformer blocks (vs. microgpt’s 1). Each block has:

Multi-head causal attention
RMSNorm
ReLU MLP with 4x expansion
Residual connections

n_embd=32

Embedding dimension. Each token is represented as a 32-dimensional vector. Doubled from microgpt’s 16.

block_size=64

Context window size. The model can attend to up to 64 previous tokens. 4x larger than microgpt’s 16.From README: “Structural validity is low because the model often hits the 64-token context limit mid-sequence without generating EOS.”

n_head=4

Number of attention heads. Same as microgpt.With n_embd=32, each head has dimension 32 / 4 = 8.

~31K parameters

Total trainable parameters across:

Token embeddings: 74 * 32 = 2,368
Position embeddings: 64 * 32 = 2,048
2 transformer layers: ~24K
LM head: 74 * 32 = 2,368

From weights.txt, the trained model is 372KB as plain text (8 decimal places per float).

Custom Autograd Implementation

The model uses a custom Value class for automatic differentiation:

class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads')
    
    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads

From model.py:8-15.

Operations

Basic math ops return new Value nodes with gradient functions:

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    return Value(self.data + other.data, (self, other), (1, 1))

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    return Value(self.data * other.data, (self, other), (other.data, self.data))

def relu(self):
    return Value(max(0, self.data), (self,), (float(self.data > 0),))

def exp(self):
    return Value(math.exp(self.data), (self,), (math.exp(self.data),))

def log(self):
    return Value(math.log(self.data), (self,), (1/self.data,))

From model.py:17-28. Each operation stores its inputs (_children) and local gradients (_local_grads).

Backpropagation

Topological sort + reverse accumulation:

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    self.grad = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad

From model.py:37-50. This implements reverse-mode autodiff by hand.

No PyTorch, no TensorFlow. Every gradient is computed manually. This makes the codebase fully transparent and educational.

Model Architecture

Token + Position Embeddings

tok_emb = state_dict['wte'][token_id]  # (n_embd,)
pos_emb = state_dict['wpe'][pos_id]    # (n_embd,)
x = [t + p for t, p in zip(tok_emb, pos_emb)]
x = rmsnorm(x)

From model.py:70-73. Each token gets:

A learned embedding based on its vocabulary index
A learned positional embedding based on its sequence position

RMSNorm is applied immediately.

RMSNorm

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

From model.py:64-67. Root mean square normalization, cheaper than LayerNorm (no mean centering).

Transformer Layer

Each layer (from model.py:75-106):

for li in range(n_layer):
    x_residual = x
    x = rmsnorm(x)
    
    # Multi-head attention
    q = linear(x, state_dict[f'layer{li}.attn_wq'])
    k = linear(x, state_dict[f'layer{li}.attn_wk'])
    v = linear(x, state_dict[f'layer{li}.attn_wv'])
    keys[li].append(k)
    values[li].append(v)
    
    x_attn = []
    for h in range(n_head):
        # Per-head attention (see below)
        ...
    
    x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
    x = [a + b for a, b in zip(x, x_residual)]  # Residual
    
    # MLP
    x_residual = x
    x = rmsnorm(x)
    x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
    x = [xi.relu() for xi in x]
    x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
    x = [a + b for a, b in zip(x, x_residual)]  # Residual

RMSNorm + Attention

Normalize, compute Q/K/V, run multi-head attention, add residual

RMSNorm + MLP

Normalize, 4x expansion with ReLU, projection back, add residual

Multi-Head Causal Attention

for h in range(n_head):
    hs = h * head_dim
    q_h = q[hs:hs+head_dim]
    k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
    v_h = [vi[hs:hs+head_dim] for vi in values[li]]
    
    attn_logits = [
        sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
        for t in range(len(k_h))
    ]
    attn_weights = softmax(attn_logits)
    
    head_out = [
        sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
        for j in range(head_dim)
    ]
    x_attn.extend(head_out)

From model.py:84-98. For each head:

Slice Q/K/V to get head-specific dimensions
Compute attention scores (Q·K / √d)
Softmax to get weights
Weighted sum of values

Causality is enforced by only storing keys/values for positions ≤ current position. The keys and values lists grow as the model generates tokens.

Output Head

logits = linear(x, state_dict['lm_head'])
return logits

From model.py:108-109. Final linear layer projects to vocabulary size (74 logits).

Training

The training loop uses cross-entropy loss and Adam optimizer:

def train_step(self, tokens, lr=0.01, beta1=0.85, beta2=0.99, eps=1e-8):
    n = min(self.block_size, len(tokens) - 1)
    if n <= 0:
        return 0.0
    
    keys, values = self.fresh_kv()
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = self.forward(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    
    loss = (1 / n) * sum(losses)
    loss.backward()
    
    # Adam update
    self.step_count += 1
    lr_t = lr * (1 - self.step_count / max(self.step_count + 1, 5000))
    for i, p in enumerate(self.params):
        self.m[i] = beta1 * self.m[i] + (1 - beta1) * p.grad
        self.v[i] = beta2 * self.v[i] + (1 - beta2) * p.grad ** 2
        m_hat = self.m[i] / (1 - beta1 ** self.step_count)
        v_hat = self.v[i] / (1 - beta2 ** self.step_count)
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)
        p.grad = 0
    
    return loss.data

From model.py:157-184.

Loss Computation

Forward pass

Run the model on tokens[0:n] to predict tokens[1:n+1]

Cross-entropy loss

-log(prob[target]) for each position, averaged

Backprop

Call loss.backward() to compute gradients

Adam update

Update parameters with momentum and adaptive learning rates

Training Results

From the README:

Loss: 4.47 → 0.25 (random baseline: ln(74) ≈ 4.3) Physical validity: 95% — moves are adjacent cells, positions in bounds Rule validity: 100% — EAT→GROW+FOOD_SPAWN, DIE→EOS

The model learned the game rules from event sequences alone, with no explicit supervision.

Training: 200 episodes, 5000 steps, ~36 hours on CPU. From train.py (not shown), the training script loads episodes from episodes.json and calls train_step() in a loop.

Sampling

Generation uses temperature-controlled sampling:

def sample(self, bos_id, eos_id, temperature=0.5, max_len=None):
    if max_len is None:
        max_len = self.block_size
    keys, values = self.fresh_kv()
    token_id = bos_id
    result = [bos_id]
    for pos_id in range(max_len - 1):
        logits = self.forward(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(
            range(self.vocab_size), 
            weights=[p.data for p in probs]
        )[0]
        result.append(token_id)
        if token_id == eos_id:
            break
    return result

From model.py:186-199.

Temperature

temperature < 1.0: Sharper distribution, more deterministic
temperature = 1.0: Use raw probabilities
temperature > 1.0: Flatter distribution, more random

From sample.py, the default is temperature=0.5.

KV Cache

The keys and values lists accumulate across generation:

def fresh_kv(self):
    return [[] for _ in range(self.n_layer)], [[] for _ in range(self.n_layer)]

From model.py:154-155. Each time a token is processed, its key and value are appended. This implements efficient causal attention without recomputing past positions.

Weight Persistence

Weights are saved as plain text:

def save_weights(self, path):
    """Save model weights as plain text."""
    with open(path, 'w') as f:
        for name, mat in self.state_dict.items():
            for r, row in enumerate(mat):
                vals = ' '.join(f'{p.data:.8f}' for p in row)
                f.write(f'{name}|{r}|{vals}\n')

From model.py:201-207. Format: parameter_name|row_index|space_separated_floats Example:

wte|0|0.05123456 -0.02345678 0.01234567 ...
wte|1|-0.01234567 0.03456789 -0.00123456 ...

No binary formats, no pickle. The entire model is human-inspectable text. See weights.txt in the repo (372KB).

Comparison to Modern Transformers

Feature	GameGPT	GPT-2 / GPT-3
Embeddings	Learned	Learned
Attention	Multi-head causal	Multi-head causal
Normalization	RMSNorm (pre-norm)	LayerNorm (post-norm in GPT-2, pre-norm in GPT-3)
MLP activation	ReLU	GELU
Positional encoding	Learned	Learned
Optimizer	Adam	AdamW
Implementation	Pure Python	PyTorch/JAX

GameGPT uses simpler choices (ReLU, RMSNorm) for implementation simplicity and speed on CPU.

Scaling Plan

From the README roadmap, future scaling directions:

Bigger models: More layers, larger embedding dim, more heads
Longer context: Increase block_size beyond 64
More games: Pac-Man (multi-entity), Survivor (massive scale), Chess (turn-based)
Ouroboros: Feed model predictions back into the game

Current bottleneck: “Structural validity is 45% because the model often hits the 64-token context limit mid-sequence.”Longer context window would improve completion rates.

Overview

Concepts

Training

Games

Transformer Architecture

Overview

Architecture Specifications

Parameter Breakdown

Custom Autograd Implementation

Operations

Backpropagation

Model Architecture

Token + Position Embeddings

RMSNorm

Transformer Layer

Multi-Head Causal Attention

Output Head

Training

Loss Computation

Training Results

Sampling

Temperature

KV Cache

Weight Persistence

Comparison to Modern Transformers

Scaling Plan

Next Steps

Tokenization

Theory

Build docs developers (and LLMs) love

Overview

Concepts

Training

Games

​Overview

​Architecture Specifications

​Parameter Breakdown

​Custom Autograd Implementation

​Operations

​Backpropagation

​Model Architecture

​Token + Position Embeddings

​RMSNorm

​Transformer Layer

​Multi-Head Causal Attention

​Output Head

​Training

​Loss Computation

​Training Results

​Sampling

​Temperature

​KV Cache

​Weight Persistence

​Comparison to Modern Transformers

​Scaling Plan

​Next Steps

Tokenization

Theory

Build docs developers (and LLMs) love

Overview

Architecture Specifications

Parameter Breakdown

Custom Autograd Implementation

Operations

Backpropagation

Model Architecture

Token + Position Embeddings

RMSNorm

Transformer Layer

Multi-Head Causal Attention

Output Head

Training

Loss Computation

Training Results

Sampling

Temperature

KV Cache

Weight Persistence

Comparison to Modern Transformers

Scaling Plan

Next Steps