Skip to main content
The GameGPT model is trained on tokenized episodes using next-token prediction with the Adam optimizer. Training runs for 5000 steps with random episode sampling and sub-sequence extraction.

Model Initialization

GameGPT is a causal transformer adapted from Karpathy’s microgpt, implemented in pure Python with custom autograd.
scripts/train.py
from game_grammar.model import GameGPT
from game_grammar.vocab import VOCAB_SIZE

model = GameGPT(
    vocab_size=VOCAB_SIZE,  # 74 tokens
    n_layer=2,
    n_embd=32,
    block_size=64,
    n_head=4,
    seed=42,
)
vocab_size
int
default:"74"
Size of the token vocabulary (BOS, EOS, TICK, SNAP, events, positions, etc.)
n_layer
int
default:"2"
Number of transformer layers
n_embd
int
default:"32"
Embedding dimension (must be divisible by n_head)
block_size
int
default:"64"
Maximum context window length in tokens
n_head
int
default:"4"
Number of attention heads
Total parameters: ~31,000

Architecture Components

The model includes:
  • Token embeddings (74 × 32)
  • Position embeddings (64 × 32)
  • 2 transformer layers with:
    • Multi-head causal self-attention (4 heads)
    • RMSNorm normalization
    • ReLU feed-forward networks (32 → 128 → 32)
  • Language model head (32 → 74)
game_grammar/model.py
class GameGPT:
    def __init__(self, vocab_size=74, n_layer=2, n_embd=32, block_size=64, n_head=4, seed=42):
        self.vocab_size = vocab_size
        self.n_layer = n_layer
        self.n_embd = n_embd
        self.block_size = block_size
        self.n_head = n_head
        self.head_dim = n_embd // n_head
        self.seed = seed

        rng = random.Random(seed)
        matrix = lambda nout, nin, std=0.08: [
            [Value(rng.gauss(0, std)) for _ in range(nin)] for _ in range(nout)
        ]
        self.state_dict = {
            'wte': matrix(vocab_size, n_embd),
            'wpe': matrix(block_size, n_embd),
            'lm_head': matrix(vocab_size, n_embd),
        }
        for i in range(n_layer):
            self.state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
            self.state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
            self.state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
            self.state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
            self.state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
            self.state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)

        self.params = [p for mat in self.state_dict.values() for row in mat for p in row]

        # Adam buffers
        self.m = [0.0] * len(self.params)
        self.v = [0.0] * len(self.params)
        self.step_count = 0
All parameters are initialized with Gaussian noise (μ=0, σ=0.08). Adam momentum buffers are initialized to zero.

Training Loop

The training loop runs for 5000 steps, randomly sampling episodes and sub-sequences.
scripts/train.py
num_steps = 5000
rng = random.Random(42)

for step in range(num_steps):
    ep = rng.choice(episodes)
    # Random offset within episode
    if len(ep) > model.block_size + 1:
        start = rng.randint(0, len(ep) - model.block_size - 1)
        tokens = ep[start:start + model.block_size + 1]
    else:
        tokens = ep

    loss = model.train_step(tokens, lr=0.01)

    if (step + 1) % 100 == 0 or step == 0:
        print(f"step {step+1:5d} / {num_steps} | loss {loss:.4f}")
Key details:
  • Random episode selection with replacement
  • Random sub-sequence extraction (block_size + 1 tokens)
  • No explicit train/validation split (evaluation uses sampling validation)
  • Loss logged every 100 steps

train_step() Method

The train_step() method implements forward pass, loss computation, backpropagation, and parameter updates.
game_grammar/model.py
def train_step(self, tokens, lr=0.01, beta1=0.85, beta2=0.99, eps=1e-8):
    n = min(self.block_size, len(tokens) - 1)
    if n <= 0:
        return 0.0

    keys, values = self.fresh_kv()
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = self.forward(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)

    loss = (1 / n) * sum(losses)
    loss.backward()

    self.step_count += 1
    lr_t = lr * (1 - self.step_count / max(self.step_count + 1, 5000))
    for i, p in enumerate(self.params):
        self.m[i] = beta1 * self.m[i] + (1 - beta1) * p.grad
        self.v[i] = beta2 * self.v[i] + (1 - beta2) * p.grad ** 2
        m_hat = self.m[i] / (1 - beta1 ** self.step_count)
        v_hat = self.v[i] / (1 - beta2 ** self.step_count)
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)
        p.grad = 0

    return loss.data

Loss Computation

  1. Forward pass: Process each token position autoregressively
  2. Cross-entropy: -log(p[target]) for each predicted token
  3. Average: Mean loss across all positions in the sequence

Optimizer: Adam

Adam optimizer with bias correction:
lr
float
default:"0.01"
Base learning rate (decays linearly over 5000 steps)
beta1
float
default:"0.85"
Exponential decay rate for first moment (momentum)
beta2
float
default:"0.99"
Exponential decay rate for second moment (RMSProp)
eps
float
default:"1e-8"
Small constant for numerical stability

Learning Rate Schedule

Linear decay from initial learning rate to zero over 5000 steps:
lr_t = lr * (1 - step_count / 5000)
The learning rate schedule assumes exactly 5000 training steps. If you train for more or fewer steps, adjust the denominator accordingly.

Loss Progression

Typical training run (200 episodes, 5000 steps):
step     1 / 5000 | loss 4.4723
step   100 / 5000 | loss 2.8934
step   200 / 5000 | loss 2.1456
step   300 / 5000 | loss 1.6782
step   400 / 5000 | loss 1.3421
step   500 / 5000 | loss 1.0893
step  1000 / 5000 | loss 0.7234
step  1500 / 5000 | loss 0.5421
step  2000 / 5000 | loss 0.4312
step  2500 / 5000 | loss 0.3678
step  3000 / 5000 | loss 0.3241
step  3500 / 5000 | loss 0.2934
step  4000 / 5000 | loss 0.2723
step  4500 / 5000 | loss 0.2589
step  5000 / 5000 | loss 0.2501
Final loss: 4.47 → 0.25 Random baseline: ln(74) ≈ 4.30 (uniform distribution over 74 tokens)
The model quickly learns structural patterns (steps 1-500) and then refines physical and rule-based constraints (steps 500-5000). Loss below 0.3 indicates strong next-token prediction accuracy.

Weight Persistence

Trained weights are saved as plain text for portability:
game_grammar/model.py
def save_weights(self, path):
    """Save model weights as plain text."""
    with open(path, 'w') as f:
        for name, mat in self.state_dict.items():
            for r, row in enumerate(mat):
                vals = ' '.join(f'{p.data:.8f}' for p in row)
                f.write(f'{name}|{r}|{vals}\n')
Format: parameter_name|row_index|space_separated_values Weights are saved to weights.txt after training completes.

Build docs developers (and LLMs) love