Model Training

The GameGPT model is trained on tokenized episodes using next-token prediction with the Adam optimizer. Training runs for 5000 steps with random episode sampling and sub-sequence extraction.

Model Initialization

GameGPT is a causal transformer adapted from Karpathy’s microgpt, implemented in pure Python with custom autograd.

scripts/train.py

from game_grammar.model import GameGPT
from game_grammar.vocab import VOCAB_SIZE

model = GameGPT(
    vocab_size=VOCAB_SIZE,  # 74 tokens
    n_layer=2,
    n_embd=32,
    block_size=64,
    n_head=4,
    seed=42,
)

vocab_size

int

default:"74"

Size of the token vocabulary (BOS, EOS, TICK, SNAP, events, positions, etc.)

n_layer

int

default:"2"

Number of transformer layers

n_embd

int

default:"32"

Embedding dimension (must be divisible by n_head)

block_size

int

default:"64"

Maximum context window length in tokens

n_head

int

default:"4"

Number of attention heads

Total parameters: ~31,000

Architecture Components

The model includes:

Token embeddings (74 × 32)
Position embeddings (64 × 32)
2 transformer layers with:
- Multi-head causal self-attention (4 heads)
- RMSNorm normalization
- ReLU feed-forward networks (32 → 128 → 32)
Language model head (32 → 74)

game_grammar/model.py

class GameGPT:
    def __init__(self, vocab_size=74, n_layer=2, n_embd=32, block_size=64, n_head=4, seed=42):
        self.vocab_size = vocab_size
        self.n_layer = n_layer
        self.n_embd = n_embd
        self.block_size = block_size
        self.n_head = n_head
        self.head_dim = n_embd // n_head
        self.seed = seed

        rng = random.Random(seed)
        matrix = lambda nout, nin, std=0.08: [
            [Value(rng.gauss(0, std)) for _ in range(nin)] for _ in range(nout)
        ]
        self.state_dict = {
            'wte': matrix(vocab_size, n_embd),
            'wpe': matrix(block_size, n_embd),
            'lm_head': matrix(vocab_size, n_embd),
        }
        for i in range(n_layer):
            self.state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
            self.state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
            self.state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
            self.state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
            self.state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
            self.state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)

        self.params = [p for mat in self.state_dict.values() for row in mat for p in row]

        # Adam buffers
        self.m = [0.0] * len(self.params)
        self.v = [0.0] * len(self.params)
        self.step_count = 0

All parameters are initialized with Gaussian noise (μ=0, σ=0.08). Adam momentum buffers are initialized to zero.

Training Loop

The training loop runs for 5000 steps, randomly sampling episodes and sub-sequences.

scripts/train.py

num_steps = 5000
rng = random.Random(42)

for step in range(num_steps):
    ep = rng.choice(episodes)
    # Random offset within episode
    if len(ep) > model.block_size + 1:
        start = rng.randint(0, len(ep) - model.block_size - 1)
        tokens = ep[start:start + model.block_size + 1]
    else:
        tokens = ep

    loss = model.train_step(tokens, lr=0.01)

    if (step + 1) % 100 == 0 or step == 0:
        print(f"step {step+1:5d} / {num_steps} | loss {loss:.4f}")

Key details:

Random episode selection with replacement
Random sub-sequence extraction (block_size + 1 tokens)
No explicit train/validation split (evaluation uses sampling validation)
Loss logged every 100 steps

train_step() Method

The train_step() method implements forward pass, loss computation, backpropagation, and parameter updates.

game_grammar/model.py

def train_step(self, tokens, lr=0.01, beta1=0.85, beta2=0.99, eps=1e-8):
    n = min(self.block_size, len(tokens) - 1)
    if n <= 0:
        return 0.0

    keys, values = self.fresh_kv()
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = self.forward(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)

    loss = (1 / n) * sum(losses)
    loss.backward()

    self.step_count += 1
    lr_t = lr * (1 - self.step_count / max(self.step_count + 1, 5000))
    for i, p in enumerate(self.params):
        self.m[i] = beta1 * self.m[i] + (1 - beta1) * p.grad
        self.v[i] = beta2 * self.v[i] + (1 - beta2) * p.grad ** 2
        m_hat = self.m[i] / (1 - beta1 ** self.step_count)
        v_hat = self.v[i] / (1 - beta2 ** self.step_count)
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)
        p.grad = 0

    return loss.data

Loss Computation

Forward pass: Process each token position autoregressively
Cross-entropy: -log(p[target]) for each predicted token
Average: Mean loss across all positions in the sequence

Optimizer: Adam

Adam optimizer with bias correction:

float

default:"0.01"

Base learning rate (decays linearly over 5000 steps)

beta1

float

default:"0.85"

Exponential decay rate for first moment (momentum)

beta2

float

default:"0.99"

Exponential decay rate for second moment (RMSProp)

eps

float

default:"1e-8"

Small constant for numerical stability

Learning Rate Schedule

Linear decay from initial learning rate to zero over 5000 steps:

lr_t = lr * (1 - step_count / 5000)

The learning rate schedule assumes exactly 5000 training steps. If you train for more or fewer steps, adjust the denominator accordingly.

Loss Progression

Typical training run (200 episodes, 5000 steps):

step     1 / 5000 | loss 4.4723
step   100 / 5000 | loss 2.8934
step   200 / 5000 | loss 2.1456
step   300 / 5000 | loss 1.6782
step   400 / 5000 | loss 1.3421
step   500 / 5000 | loss 1.0893
step  1000 / 5000 | loss 0.7234
step  1500 / 5000 | loss 0.5421
step  2000 / 5000 | loss 0.4312
step  2500 / 5000 | loss 0.3678
step  3000 / 5000 | loss 0.3241
step  3500 / 5000 | loss 0.2934
step  4000 / 5000 | loss 0.2723
step  4500 / 5000 | loss 0.2589
step  5000 / 5000 | loss 0.2501

Final loss: 4.47 → 0.25 Random baseline: ln(74) ≈ 4.30 (uniform distribution over 74 tokens)

The model quickly learns structural patterns (steps 1-500) and then refines physical and rule-based constraints (steps 500-5000). Loss below 0.3 indicates strong next-token prediction accuracy.

Weight Persistence

Trained weights are saved as plain text for portability:

game_grammar/model.py

def save_weights(self, path):
    """Save model weights as plain text."""
    with open(path, 'w') as f:
        for name, mat in self.state_dict.items():
            for r, row in enumerate(mat):
                vals = ' '.join(f'{p.data:.8f}' for p in row)
                f.write(f'{name}|{r}|{vals}\n')

Format: parameter_name|row_index|space_separated_values Weights are saved to weights.txt after training completes.

Overview

Concepts

Training

Games

Model Initialization

Architecture Components

Training Loop

train_step() Method

Loss Computation

Optimizer: Adam

Learning Rate Schedule

Loss Progression

Weight Persistence

Build docs developers (and LLMs) love

Overview

Concepts

Training

Games

​Model Initialization

​Architecture Components

​Training Loop

​train_step() Method

​Loss Computation

​Optimizer: Adam

​Learning Rate Schedule

​Loss Progression

​Weight Persistence

Build docs developers (and LLMs) love

Model Initialization

Architecture Components

Training Loop

train_step() Method

Loss Computation

Optimizer: Adam

Learning Rate Schedule

Loss Progression

Weight Persistence