The GameGPT model is trained on tokenized episodes using next-token prediction with the Adam optimizer. Training runs for 5000 steps with random episode sampling and sub-sequence extraction.
Model Initialization
GameGPT is a causal transformer adapted from Karpathy’s microgpt, implemented in pure Python with custom autograd.
from game_grammar.model import GameGPT
from game_grammar.vocab import VOCAB_SIZE
model = GameGPT(
vocab_size=VOCAB_SIZE, # 74 tokens
n_layer=2,
n_embd=32,
block_size=64,
n_head=4,
seed=42,
)
Size of the token vocabulary (BOS, EOS, TICK, SNAP, events, positions, etc.)
Number of transformer layers
Embedding dimension (must be divisible by n_head)
Maximum context window length in tokens
Number of attention heads
Total parameters: ~31,000
Architecture Components
The model includes:
- Token embeddings (74 × 32)
- Position embeddings (64 × 32)
- 2 transformer layers with:
- Multi-head causal self-attention (4 heads)
- RMSNorm normalization
- ReLU feed-forward networks (32 → 128 → 32)
- Language model head (32 → 74)
class GameGPT:
def __init__(self, vocab_size=74, n_layer=2, n_embd=32, block_size=64, n_head=4, seed=42):
self.vocab_size = vocab_size
self.n_layer = n_layer
self.n_embd = n_embd
self.block_size = block_size
self.n_head = n_head
self.head_dim = n_embd // n_head
self.seed = seed
rng = random.Random(seed)
matrix = lambda nout, nin, std=0.08: [
[Value(rng.gauss(0, std)) for _ in range(nin)] for _ in range(nout)
]
self.state_dict = {
'wte': matrix(vocab_size, n_embd),
'wpe': matrix(block_size, n_embd),
'lm_head': matrix(vocab_size, n_embd),
}
for i in range(n_layer):
self.state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
self.state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
self.state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
self.state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
self.state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
self.state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)
self.params = [p for mat in self.state_dict.values() for row in mat for p in row]
# Adam buffers
self.m = [0.0] * len(self.params)
self.v = [0.0] * len(self.params)
self.step_count = 0
All parameters are initialized with Gaussian noise (μ=0, σ=0.08). Adam momentum buffers are initialized to zero.
Training Loop
The training loop runs for 5000 steps, randomly sampling episodes and sub-sequences.
num_steps = 5000
rng = random.Random(42)
for step in range(num_steps):
ep = rng.choice(episodes)
# Random offset within episode
if len(ep) > model.block_size + 1:
start = rng.randint(0, len(ep) - model.block_size - 1)
tokens = ep[start:start + model.block_size + 1]
else:
tokens = ep
loss = model.train_step(tokens, lr=0.01)
if (step + 1) % 100 == 0 or step == 0:
print(f"step {step+1:5d} / {num_steps} | loss {loss:.4f}")
Key details:
- Random episode selection with replacement
- Random sub-sequence extraction (block_size + 1 tokens)
- No explicit train/validation split (evaluation uses sampling validation)
- Loss logged every 100 steps
train_step() Method
The train_step() method implements forward pass, loss computation, backpropagation, and parameter updates.
def train_step(self, tokens, lr=0.01, beta1=0.85, beta2=0.99, eps=1e-8):
n = min(self.block_size, len(tokens) - 1)
if n <= 0:
return 0.0
keys, values = self.fresh_kv()
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = self.forward(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)
loss = (1 / n) * sum(losses)
loss.backward()
self.step_count += 1
lr_t = lr * (1 - self.step_count / max(self.step_count + 1, 5000))
for i, p in enumerate(self.params):
self.m[i] = beta1 * self.m[i] + (1 - beta1) * p.grad
self.v[i] = beta2 * self.v[i] + (1 - beta2) * p.grad ** 2
m_hat = self.m[i] / (1 - beta1 ** self.step_count)
v_hat = self.v[i] / (1 - beta2 ** self.step_count)
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)
p.grad = 0
return loss.data
Loss Computation
- Forward pass: Process each token position autoregressively
- Cross-entropy:
-log(p[target]) for each predicted token
- Average: Mean loss across all positions in the sequence
Optimizer: Adam
Adam optimizer with bias correction:
Base learning rate (decays linearly over 5000 steps)
Exponential decay rate for first moment (momentum)
Exponential decay rate for second moment (RMSProp)
Small constant for numerical stability
Learning Rate Schedule
Linear decay from initial learning rate to zero over 5000 steps:
lr_t = lr * (1 - step_count / 5000)
The learning rate schedule assumes exactly 5000 training steps. If you train for more or fewer steps, adjust the denominator accordingly.
Loss Progression
Typical training run (200 episodes, 5000 steps):
step 1 / 5000 | loss 4.4723
step 100 / 5000 | loss 2.8934
step 200 / 5000 | loss 2.1456
step 300 / 5000 | loss 1.6782
step 400 / 5000 | loss 1.3421
step 500 / 5000 | loss 1.0893
step 1000 / 5000 | loss 0.7234
step 1500 / 5000 | loss 0.5421
step 2000 / 5000 | loss 0.4312
step 2500 / 5000 | loss 0.3678
step 3000 / 5000 | loss 0.3241
step 3500 / 5000 | loss 0.2934
step 4000 / 5000 | loss 0.2723
step 4500 / 5000 | loss 0.2589
step 5000 / 5000 | loss 0.2501
Final loss: 4.47 → 0.25
Random baseline: ln(74) ≈ 4.30 (uniform distribution over 74 tokens)
The model quickly learns structural patterns (steps 1-500) and then refines physical and rule-based constraints (steps 500-5000). Loss below 0.3 indicates strong next-token prediction accuracy.
Weight Persistence
Trained weights are saved as plain text for portability:
def save_weights(self, path):
"""Save model weights as plain text."""
with open(path, 'w') as f:
for name, mat in self.state_dict.items():
for r, row in enumerate(mat):
vals = ' '.join(f'{p.data:.8f}' for p in row)
f.write(f'{name}|{r}|{vals}\n')
Format: parameter_name|row_index|space_separated_values
Weights are saved to weights.txt after training completes.