Skip to main content
The GameGPT model provides a lightweight, custom-built GPT implementation with autograd capabilities for training on game-grammar tokens.

Value

Custom autograd value class for automatic differentiation.
from game_grammar.model import Value

class Value:
    def __init__(self, data, children=(), local_grads=())

Constructor Parameters

data
float
required
The numerical value
children
tuple
default:"()"
Tuple of parent Value nodes in the computation graph
local_grads
tuple
default:"()"
Tuple of local gradients with respect to each child

Attributes

data
float
The actual numerical value
grad
float
The accumulated gradient (initialized to 0)

Methods

backward()
method
Compute gradients via backpropagation through the computation graph

Supported Operations

  • Arithmetic: +, -, *, /, ** (power)
  • Activations: .relu(), .exp(), .log()
  • Reverse operations: All operations support reversed operands (e.g., 5 + value)

Example

from game_grammar.model import Value

# Build computation graph
a = Value(2.0)
b = Value(3.0)
c = a * b + a**2

# Compute gradients
c.backward()
print(f"dc/da = {a.grad}")  # 7.0 = b.data + 2*a.data
print(f"dc/db = {b.grad}")  # 2.0 = a.data

GameGPT

Transformer-based language model for game-grammar token generation.
from game_grammar.model import GameGPT

class GameGPT:
    def __init__(self, vocab_size=74, n_layer=2, n_embd=32, 
                 block_size=64, n_head=4, seed=42)

Constructor Parameters

vocab_size
int
default:"74"
Size of the token vocabulary
n_layer
int
default:"2"
Number of transformer layers
n_embd
int
default:"32"
Embedding dimension size
block_size
int
default:"64"
Maximum context length (sequence length)
n_head
int
default:"4"
Number of attention heads per layer
seed
int
default:"42"
Random seed for weight initialization
The n_embd parameter must be divisible by n_head. The head dimension is automatically calculated as n_embd // n_head.

Attributes

state_dict
dict
Dictionary containing all model weights:
  • wte: Token embeddings (vocab_size × n_embd)
  • wpe: Position embeddings (block_size × n_embd)
  • lm_head: Output projection (vocab_size × n_embd)
  • layer{i}.attn_wq/wk/wv/wo: Attention weights
  • layer{i}.mlp_fc1/fc2: MLP weights
params
list[Value]
Flat list of all trainable parameters as Value objects
m
list[float]
Adam optimizer first moment estimates
v
list[float]
Adam optimizer second moment estimates
step_count
int
Number of training steps performed

Methods

forward

Run a single forward pass through the model.
def forward(self, token_id, pos_id, keys, values)
token_id
int
required
Index of the input token (0 to vocab_size-1)
pos_id
int
required
Position index in the sequence (0 to block_size-1)
keys
list[list]
required
Cached key vectors from previous tokens, structured as [layer_idx][token_idx][dim]
values
list[list]
required
Cached value vectors from previous tokens, structured as [layer_idx][token_idx][dim]
return
list[Value]
List of logits (one per vocab entry) as Value objects

train_step

Perform one training step with Adam optimizer.
def train_step(self, tokens, lr=0.01, beta1=0.85, beta2=0.99, eps=1e-8)
tokens
list[int]
required
List of token IDs for training. Must contain at least 2 tokens.
lr
float
default:"0.01"
Base learning rate (with linear warmup/decay)
beta1
float
default:"0.85"
Adam optimizer exponential decay rate for first moment
beta2
float
default:"0.99"
Adam optimizer exponential decay rate for second moment
eps
float
default:"1e-8"
Adam optimizer epsilon for numerical stability
return
float
Average cross-entropy loss for the sequence
The learning rate is automatically scheduled with linear decay: lr_t = lr * (1 - step_count / max(step_count + 1, 5000))

Example

model = GameGPT(vocab_size=100, n_layer=3, n_embd=64)

# Training loop
for epoch in range(10):
    for batch in dataset:
        loss = model.train_step(
            tokens=batch,
            lr=0.001,
            beta1=0.9,
            beta2=0.999
        )
        print(f"Loss: {loss:.4f}")

sample

Generate a sequence of tokens using the trained model.
def sample(self, bos_id, eos_id, temperature=0.5, max_len=None)
bos_id
int
required
Token ID for beginning-of-sequence (BOS)
eos_id
int
required
Token ID for end-of-sequence (EOS). Generation stops when this is sampled.
temperature
float
default:"0.5"
Sampling temperature. Lower values (< 1.0) make output more deterministic, higher values (> 1.0) increase randomness.
max_len
int
default:"block_size"
Maximum sequence length to generate. Defaults to model’s block_size.
return
list[int]
List of generated token IDs, including BOS and EOS tokens

Example

model = GameGPT(vocab_size=100)

# Generate a sequence
sequence = model.sample(
    bos_id=0,
    eos_id=1,
    temperature=0.7,
    max_len=50
)

print(f"Generated {len(sequence)} tokens: {sequence}")

save_weights

Save model weights to a plain text file.
def save_weights(self, path)
path
str
required
File path where weights will be saved
Weights are saved in plain text format with 8 decimal places: {layer_name}|{row_idx}|{space-separated values}

Example

model = GameGPT()
model.save_weights("checkpoints/model_epoch10.txt")

load_weights

Load model weights from a plain text file.
def load_weights(self, path)
path
str
required
File path to load weights from
The model must be initialized with the same architecture parameters (vocab_size, n_layer, n_embd, etc.) as when weights were saved.

Example

model = GameGPT(vocab_size=100, n_layer=3, n_embd=64)
model.load_weights("checkpoints/model_epoch10.txt")

Complete Usage Example

from game_grammar.model import GameGPT

# Initialize model
model = GameGPT(
    vocab_size=100,
    n_layer=3,
    n_embd=64,
    block_size=128,
    n_head=8,
    seed=42
)

# Training
for epoch in range(num_epochs):
    for tokens in training_data:
        loss = model.train_step(tokens, lr=0.001)
        if step % 100 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.4f}")
    
    # Save checkpoint
    model.save_weights(f"checkpoint_epoch_{epoch}.txt")

Architecture Details

The GameGPT model implements a decoder-only transformer with:
  • RMSNorm instead of LayerNorm for efficiency
  • Causal attention with KV caching for autoregressive generation
  • Adam optimizer with bias correction
  • Custom autograd via the Value class (no external ML frameworks)
  • Plain text weight format for easy inspection and debugging
The model is specifically optimized for game-grammar token sequences and supports:
  • Small vocabulary sizes (default 74 tokens)
  • Short context windows (default 64 tokens)
  • Fast training on CPU
  • Minimal dependencies

Build docs developers (and LLMs) love