GameGPT Model

The GameGPT model provides a lightweight, custom-built GPT implementation with autograd capabilities for training on game-grammar tokens.

Value

Custom autograd value class for automatic differentiation.

from game_grammar.model import Value

class Value:
    def __init__(self, data, children=(), local_grads=())

Constructor Parameters

data

float

required

The numerical value

children

tuple

default:"()"

Tuple of parent Value nodes in the computation graph

local_grads

tuple

default:"()"

Tuple of local gradients with respect to each child

Attributes

data

float

The actual numerical value

grad

float

The accumulated gradient (initialized to 0)

Methods

backward()

method

Compute gradients via backpropagation through the computation graph

Supported Operations

Arithmetic: +, -, *, /, ** (power)
Activations: .relu(), .exp(), .log()
Reverse operations: All operations support reversed operands (e.g., 5 + value)

Example

from game_grammar.model import Value

# Build computation graph
a = Value(2.0)
b = Value(3.0)
c = a * b + a**2

# Compute gradients
c.backward()
print(f"dc/da = {a.grad}")  # 7.0 = b.data + 2*a.data
print(f"dc/db = {b.grad}")  # 2.0 = a.data

GameGPT

Transformer-based language model for game-grammar token generation.

from game_grammar.model import GameGPT

class GameGPT:
    def __init__(self, vocab_size=74, n_layer=2, n_embd=32, 
                 block_size=64, n_head=4, seed=42)

Constructor Parameters

vocab_size

int

default:"74"

Size of the token vocabulary

n_layer

int

default:"2"

Number of transformer layers

n_embd

int

default:"32"

Embedding dimension size

block_size

int

default:"64"

Maximum context length (sequence length)

n_head

int

default:"4"

Number of attention heads per layer

seed

int

default:"42"

Random seed for weight initialization

The n_embd parameter must be divisible by n_head. The head dimension is automatically calculated as n_embd // n_head.

Attributes

state_dict

dict

Dictionary containing all model weights:

wte: Token embeddings (vocab_size × n_embd)
wpe: Position embeddings (block_size × n_embd)
lm_head: Output projection (vocab_size × n_embd)
layer{i}.attn_wq/wk/wv/wo: Attention weights
layer{i}.mlp_fc1/fc2: MLP weights

params

list[Value]

Flat list of all trainable parameters as Value objects

list[float]

Adam optimizer first moment estimates

list[float]

Adam optimizer second moment estimates

step_count

int

Number of training steps performed

Methods

forward

Run a single forward pass through the model.

def forward(self, token_id, pos_id, keys, values)

token_id

int

required

Index of the input token (0 to vocab_size-1)

pos_id

int

required

Position index in the sequence (0 to block_size-1)

keys

list[list]

required

Cached key vectors from previous tokens, structured as [layer_idx][token_idx][dim]

values

list[list]

required

Cached value vectors from previous tokens, structured as [layer_idx][token_idx][dim]

return

list[Value]

List of logits (one per vocab entry) as Value objects

train_step

Perform one training step with Adam optimizer.

def train_step(self, tokens, lr=0.01, beta1=0.85, beta2=0.99, eps=1e-8)

tokens

list[int]

required

List of token IDs for training. Must contain at least 2 tokens.

float

default:"0.01"

Base learning rate (with linear warmup/decay)

beta1

float

default:"0.85"

Adam optimizer exponential decay rate for first moment

beta2

float

default:"0.99"

Adam optimizer exponential decay rate for second moment

eps

float

default:"1e-8"

Adam optimizer epsilon for numerical stability

return

float

Average cross-entropy loss for the sequence

The learning rate is automatically scheduled with linear decay: lr_t = lr * (1 - step_count / max(step_count + 1, 5000))

Example

model = GameGPT(vocab_size=100, n_layer=3, n_embd=64)

# Training loop
for epoch in range(10):
    for batch in dataset:
        loss = model.train_step(
            tokens=batch,
            lr=0.001,
            beta1=0.9,
            beta2=0.999
        )
        print(f"Loss: {loss:.4f}")

sample

Generate a sequence of tokens using the trained model.

def sample(self, bos_id, eos_id, temperature=0.5, max_len=None)

bos_id

int

required

Token ID for beginning-of-sequence (BOS)

eos_id

int

required

Token ID for end-of-sequence (EOS). Generation stops when this is sampled.

temperature

float

default:"0.5"

Sampling temperature. Lower values (< 1.0) make output more deterministic, higher values (> 1.0) increase randomness.

max_len

int

default:"block_size"

Maximum sequence length to generate. Defaults to model’s block_size.

return

list[int]

List of generated token IDs, including BOS and EOS tokens

Example

model = GameGPT(vocab_size=100)

# Generate a sequence
sequence = model.sample(
    bos_id=0,
    eos_id=1,
    temperature=0.7,
    max_len=50
)

print(f"Generated {len(sequence)} tokens: {sequence}")

save_weights

Save model weights to a plain text file.

def save_weights(self, path)

path

str

required

File path where weights will be saved

Weights are saved in plain text format with 8 decimal places: {layer_name}|{row_idx}|{space-separated values}

Example

model = GameGPT()
model.save_weights("checkpoints/model_epoch10.txt")

load_weights

Load model weights from a plain text file.

def load_weights(self, path)

path

str

required

File path to load weights from

The model must be initialized with the same architecture parameters (vocab_size, n_layer, n_embd, etc.) as when weights were saved.

Example

model = GameGPT(vocab_size=100, n_layer=3, n_embd=64)
model.load_weights("checkpoints/model_epoch10.txt")

Complete Usage Example

from game_grammar.model import GameGPT

# Initialize model
model = GameGPT(
    vocab_size=100,
    n_layer=3,
    n_embd=64,
    block_size=128,
    n_head=8,
    seed=42
)

# Training
for epoch in range(num_epochs):
    for tokens in training_data:
        loss = model.train_step(tokens, lr=0.001)
        if step % 100 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.4f}")
    
    # Save checkpoint
    model.save_weights(f"checkpoint_epoch_{epoch}.txt")

Architecture Details

The GameGPT model implements a decoder-only transformer with:

RMSNorm instead of LayerNorm for efficiency
Causal attention with KV caching for autoregressive generation
Adam optimizer with bias correction
Custom autograd via the Value class (no external ML frameworks)
Plain text weight format for easy inspection and debugging

The model is specifically optimized for game-grammar token sequences and supports:

Small vocabulary sizes (default 74 tokens)
Short context windows (default 64 tokens)
Fast training on CPU
Minimal dependencies

Core

Game & Agents

Data Pipeline

Scripts

Value

Constructor Parameters

Attributes

Methods

Supported Operations

Example

GameGPT

Constructor Parameters

Attributes

Methods

forward

train_step

Example

sample

Example

save_weights

Example

load_weights

Example

Complete Usage Example

Architecture Details

Build docs developers (and LLMs) love

Core

Game & Agents

Data Pipeline

Scripts

​Value

​Constructor Parameters

​Attributes

​Methods

​Supported Operations

​Example

​GameGPT

​Constructor Parameters

​Attributes

​Methods

​forward

​train_step

​Example

​sample

​Example

​save_weights

​Example

​load_weights

​Example

​Complete Usage Example

​Architecture Details

Build docs developers (and LLMs) love

Value

Constructor Parameters

Attributes

Methods

Supported Operations

Example

GameGPT

Constructor Parameters

Attributes

Methods

forward

train_step

Example

sample

Example

save_weights

Example

load_weights

Example

Complete Usage Example

Architecture Details