Value
Custom autograd value class for automatic differentiation.Constructor Parameters
The numerical value
Tuple of parent Value nodes in the computation graph
Tuple of local gradients with respect to each child
Attributes
The actual numerical value
The accumulated gradient (initialized to 0)
Methods
Compute gradients via backpropagation through the computation graph
Supported Operations
- Arithmetic:
+,-,*,/,**(power) - Activations:
.relu(),.exp(),.log() - Reverse operations: All operations support reversed operands (e.g.,
5 + value)
Example
GameGPT
Transformer-based language model for game-grammar token generation.Constructor Parameters
Size of the token vocabulary
Number of transformer layers
Embedding dimension size
Maximum context length (sequence length)
Number of attention heads per layer
Random seed for weight initialization
The
n_embd parameter must be divisible by n_head. The head dimension is automatically calculated as n_embd // n_head.Attributes
Dictionary containing all model weights:
wte: Token embeddings (vocab_size × n_embd)wpe: Position embeddings (block_size × n_embd)lm_head: Output projection (vocab_size × n_embd)layer{i}.attn_wq/wk/wv/wo: Attention weightslayer{i}.mlp_fc1/fc2: MLP weights
Flat list of all trainable parameters as Value objects
Adam optimizer first moment estimates
Adam optimizer second moment estimates
Number of training steps performed
Methods
forward
Run a single forward pass through the model.Index of the input token (0 to vocab_size-1)
Position index in the sequence (0 to block_size-1)
Cached key vectors from previous tokens, structured as
[layer_idx][token_idx][dim]Cached value vectors from previous tokens, structured as
[layer_idx][token_idx][dim]List of logits (one per vocab entry) as Value objects
train_step
Perform one training step with Adam optimizer.List of token IDs for training. Must contain at least 2 tokens.
Base learning rate (with linear warmup/decay)
Adam optimizer exponential decay rate for first moment
Adam optimizer exponential decay rate for second moment
Adam optimizer epsilon for numerical stability
Average cross-entropy loss for the sequence
The learning rate is automatically scheduled with linear decay:
lr_t = lr * (1 - step_count / max(step_count + 1, 5000))Example
sample
Generate a sequence of tokens using the trained model.Token ID for beginning-of-sequence (BOS)
Token ID for end-of-sequence (EOS). Generation stops when this is sampled.
Sampling temperature. Lower values (< 1.0) make output more deterministic, higher values (> 1.0) increase randomness.
Maximum sequence length to generate. Defaults to model’s block_size.
List of generated token IDs, including BOS and EOS tokens
Example
save_weights
Save model weights to a plain text file.File path where weights will be saved
Weights are saved in plain text format with 8 decimal places:
{layer_name}|{row_idx}|{space-separated values}Example
load_weights
Load model weights from a plain text file.File path to load weights from
The model must be initialized with the same architecture parameters (vocab_size, n_layer, n_embd, etc.) as when weights were saved.
Example
Complete Usage Example
Architecture Details
The GameGPT model implements a decoder-only transformer with:- RMSNorm instead of LayerNorm for efficiency
- Causal attention with KV caching for autoregressive generation
- Adam optimizer with bias correction
- Custom autograd via the Value class (no external ML frameworks)
- Plain text weight format for easy inspection and debugging
- Small vocabulary sizes (default 74 tokens)
- Short context windows (default 64 tokens)
- Fast training on CPU
- Minimal dependencies
