Tokenization Approach

Overview

Game Grammar uses a hybrid snapshot+delta encoding strategy, inspired by video codecs:

Snapshots (I-frames): Periodic full state captures for context recovery
Deltas (P-frames): High-frequency event tokens capturing state changes

This approach balances:

Compression: Delta events are compact
Recovery: Snapshots prevent error accumulation
Learnability: Transformers can attend to both absolute state and relative changes

From codec.py:21-23, the encoder takes snapshot_interval=16 and salience_threshold=Salience.TICK as parameters.

74-Token Vocabulary

The complete vocabulary from vocab.py covers all Snake gameplay:

Structural Tokens (4)

["BOS", "EOS", "TICK", "SNAP"]

BOS: Begin sequence (start of episode)
EOS: End sequence (episode terminated)
TICK: Time step marker (bundles events)
SNAP: Snapshot marker (full state follows)

From vocab.py:6.

Entity Tokens (3)

["PLAYER", "FOOD", "WALL"]

Entity types in Snake. Pac-Man would add "GHOST", "PELLET", "POWER_PELLET", etc. From vocab.py:9.

Direction Tokens (4)

["DIR_U", "DIR_D", "DIR_L", "DIR_R"]

Cardinal directions for snake heading. Used in snapshots. From vocab.py:12.

Input Tokens (4)

["INPUT_U", "INPUT_D", "INPUT_L", "INPUT_R"]

Player input events. Separate from direction because input can be ignored (e.g., reversals). From vocab.py:15.

Position Tokens (20)

# X coordinates (10)
["X0", "X1", "X2", ..., "X9"]

# Y coordinates (10)
["Y0", "Y1", "Y2", ..., "Y9"]

Absolute grid positions for 10x10 Snake. Positions are tokenized as separate X/Y pairs. From vocab.py:18-21.

Event Type Tokens (7)

["MOVE", "EAT", "GROW", "DIE_WALL", "DIE_SELF", "FOOD_SPAWN", "SCORE"]

The core event types from Snake gameplay. From vocab.py:24.

Value Tokens (11)

["V0", "V1", "V2", ..., "V10"]

Scores from 0-10. Scores above 10 are clamped to V10. From vocab.py:27.

Length Tokens (21)

["LEN1", "LEN2", "LEN3", ..., "LEN20", "LEN_LONG"]

Snake body length. LEN_LONG for lengths > 20. From vocab.py:30-31.

Total: 74 Tokens

assert VOCAB_SIZE == 74

From vocab.py:37. This is significantly larger than microgpt’s ~27 character vocabulary, but much smaller than language models (50K+ tokens).

Encoding Strategy

The EventCodec class in codec.py implements the hybrid approach.

Snapshot Encoding

A snapshot captures full state at a point in time:

def encode_snapshot(self, state: SnakeState) -> list[int]:
    length = len(state.body)
    len_tok = f"LEN{length}" if length <= 20 else "LEN_LONG"
    score_tok = f"V{min(state.score, 10)}"
    tokens = [
        "SNAP", "PLAYER",
        f"X{state.head[0]}", f"Y{state.head[1]}",
        _DIR_TOKEN[state.direction],
        len_tok,
        "FOOD",
        f"X{state.food[0]}", f"Y{state.food[1]}",
        "SCORE", score_tok,
    ]
    return [VOCAB[t] for t in tokens]

From codec.py:25-38. Example snapshot: SNAP PLAYER X5 Y5 DIR_R LEN1 FOOD X8 Y6 SCORE V0 This is 11 tokens encoding:

Player position (X5, Y5)
Player direction (DIR_R)
Snake length (LEN1)
Food position (X8, Y6)
Current score (V0)

Snapshots don’t include the full body — only head position and length. This is a design choice to reduce token count. The body can be reconstructed by tracking deltas.

Delta Event Encoding

Delta events encode state changes:

def encode_event(self, event: Event) -> list[int]:
    t = event.type
    tokens: list[str] = []
    
    if t.startswith("INPUT_"):
        tokens.append(t)
    elif t == "MOVE":
        x, y = event.payload["pos"]
        tokens.extend(["MOVE", f"X{x}", f"Y{y}"])
    elif t == "EAT":
        tokens.append("EAT")
    elif t == "GROW":
        length = event.payload["length"]
        len_tok = f"LEN{length}" if length <= 20 else "LEN_LONG"
        tokens.extend(["GROW", len_tok])
    # ... etc ...
    
    return [VOCAB[t] for t in tokens]

From codec.py:40-69. Examples:

INPUT_R → 1 token
MOVE X7 Y8 → 3 tokens
EAT → 1 token
GROW LEN3 → 2 tokens
FOOD_SPAWN X2 Y4 → 3 tokens

Tick Bundling

Events are grouped by tick:

def encode_tick_events(self, events: list[Event]) -> list[int]:
    filtered = [e for e in events if e.salience >= self.salience_threshold]
    if not filtered:
        return []
    tokens = [VOCAB["TICK"]]
    for event in filtered:
        tokens.extend(self.encode_event(event))
    return tokens

From codec.py:71-78. All events with the same tick are encoded after a single TICK token:

TICK INPUT_R MOVE X7 Y8 EAT GROW LEN3 FOOD_SPAWN X2 Y4 SCORE V2

Episode Encoding

Complete episodes are wrapped in BOS and EOS:

def encode_episode(
    self,
    events_by_tick: dict[int, list[Event]],
    states_by_tick: dict[int, SnakeState],
) -> list[int]:
    tokens = [VOCAB["BOS"]]
    
    max_tick = max(states_by_tick.keys()) if states_by_tick else 0
    
    for tick in range(max_tick + 1):
        # Snapshot at tick 0, every snapshot_interval, or on rule-effect events
        need_snapshot = (
            tick == 0
            or (tick % self.snapshot_interval == 0)
            or any(
                e.salience >= Salience.RULE_EFFECT
                for e in events_by_tick.get(tick, [])
            )
        )
        if need_snapshot and tick in states_by_tick:
            tokens.extend(self.encode_snapshot(states_by_tick[tick]))
        
        if tick in events_by_tick:
            tick_tokens = self.encode_tick_events(events_by_tick[tick])
            tokens.extend(tick_tokens)
    
    tokens.append(VOCAB["EOS"])
    return tokens

From codec.py:80-107.

Why Keyframe + Delta?

This hybrid approach mirrors video codec design (I-frames + P-frames):

Compression

Delta events are compact. MOVE X7 Y8 is 3 tokens, while a full state snapshot is 11 tokens.Pure delta encoding would be even more compact, but has problems (see next).

Error Recovery

Pure delta encoding accumulates errors. If the model predicts one wrong move, all future positions are offset.Periodic snapshots reset the state, allowing the model to recover from mistakes.

Long-Range Context

Snapshots give the transformer absolute position context. Without them, the model must integrate all deltas from the start of the sequence.With a 64-token context window, the model can only “see” ~4-8 game ticks back. Snapshots provide grounding.

Conditional Rules

Some events trigger snapshots immediately. From codec.py:92-97, any event with salience >= Salience.RULE_EFFECT forces a snapshot.This ensures the model sees the full state after important transitions (eating food, dying, etc.).

Snapshot interval is configurable (snapshot_interval=16 in the current implementation). More snapshots = more context, fewer snapshots = better compression.

Real Encoded Sequence Example

From the README, a real Snake game encodes as:

BOS SNAP PLAYER X5 Y5 DIR_R LEN1 FOOD X8 Y6 SCORE V0
  TICK INPUT_L MOVE X4 Y5
  TICK INPUT_D MOVE X4 Y6
  TICK INPUT_D MOVE X4 Y7
  TICK INPUT_R MOVE X5 Y7
  TICK INPUT_R MOVE X6 Y7
  TICK INPUT_R MOVE X7 Y7
  TICK INPUT_R MOVE X8 Y7
  TICK INPUT_U MOVE X8 Y6 EAT GROW LEN2 FOOD_SPAWN X3 Y2 SCORE V1
  SNAP PLAYER X8 Y6 DIR_U LEN2 FOOD X3 Y2 SCORE V1
  TICK INPUT_U MOVE X8 Y5
  ...
  TICK INPUT_R MOVE X10 Y5 DIE_WALL
EOS

Note:

Initial snapshot at tick 0
Delta events for each move
Snapshot after eating (rule effect)
Death event followed by EOS

Decoding

The codec includes a decoder for validation:

def decode(self, tokens: list[int]) -> list[dict]:
    """Decode token sequence into a list of parsed records for validation."""
    names = [ID_TO_TOKEN[t] for t in tokens]
    records: list[dict] = []
    i = 0
    while i < len(names):
        tok = names[i]
        if tok == "BOS":
            records.append({"type": "BOS"})
            i += 1
        elif tok == "SNAP":
            # Parse 11-token snapshot
            rec = {"type": "SNAP"}
            if i + 10 < len(names):
                rec["player_x"] = names[i + 2]
                rec["player_y"] = names[i + 3]
                # ... etc ...
            records.append(rec)
            i += 11
        elif tok == "MOVE":
            rec = {"type": "MOVE"}
            if i + 2 < len(names):
                rec["x"] = names[i + 1]
                rec["y"] = names[i + 2]
            records.append(rec)
            i += 3
        # ... etc ...
    return records

From codec.py:109-179. The decoder parses multi-token events back into structured records for the validator to check.

Vocabulary Design Trade-offs

Current Approach (74 tokens)
Alternative: Binary Positions
Alternative: Relative Deltas

Pros:

Absolute positions as single tokens (X7, Y8)
Clear event structure
Human-readable sequences
Small vocabulary size

Cons:

Grid size is baked in (10x10)
Scaling to larger grids requires more tokens
Each game needs a custom vocabulary

Encode positions as binary digits: X_BIT0, X_BIT1, X_BIT2, X_BIT3.Pros:

Scales to arbitrary grid sizes
Smaller vocabulary (4 tokens for 0-15 range)

Cons:

Model must learn binary arithmetic
Less human-readable
Longer sequences (4 tokens per coordinate)

Encode only direction, not absolute position: MOVE_UP, MOVE_DOWN, etc.Pros:

Very compact (1 token per move)
Game-agnostic (works for any grid)

Cons:

No error recovery
Transformer must integrate all deltas
Hard to attend to specific positions

From the README: “The most valuable contributions right now are feedback on the tokenization approaches — this is the core unsolved problem.”

Hyperparameters

From codec.py:21-23:

class EventCodec:
    def __init__(self, snapshot_interval=16, salience_threshold=Salience.TICK):
        self.snapshot_interval = snapshot_interval
        self.salience_threshold = salience_threshold

snapshot_interval=16: Insert snapshot every 16 ticks (+ on rule effects)
salience_threshold=Salience.TICK: Include all events (no filtering)

These are tunable. Higher snapshot_interval = more compression, less context.

Next Steps

Event Streams

See how games produce events

Transformer Architecture

Learn how the model processes tokens

Overview

Concepts

Training

Games

Overview

74-Token Vocabulary

Structural Tokens (4)

Entity Tokens (3)

Direction Tokens (4)

Input Tokens (4)

Position Tokens (20)

Event Type Tokens (7)

Value Tokens (11)

Length Tokens (21)

Total: 74 Tokens

Encoding Strategy

Snapshot Encoding

Delta Event Encoding

Tick Bundling

Episode Encoding

Why Keyframe + Delta?

Real Encoded Sequence Example

Decoding

Vocabulary Design Trade-offs

Hyperparameters

Next Steps

Event Streams

Transformer Architecture

Build docs developers (and LLMs) love

Overview

Concepts

Training

Games

​Overview

​74-Token Vocabulary

​Structural Tokens (4)

​Entity Tokens (3)

​Direction Tokens (4)

​Input Tokens (4)

​Position Tokens (20)

​Event Type Tokens (7)

​Value Tokens (11)

​Length Tokens (21)

​Total: 74 Tokens

​Encoding Strategy

​Snapshot Encoding

​Delta Event Encoding

​Tick Bundling

​Episode Encoding

​Why Keyframe + Delta?

​Real Encoded Sequence Example

​Decoding

​Vocabulary Design Trade-offs

​Hyperparameters

​Next Steps

Event Streams

Transformer Architecture

Build docs developers (and LLMs) love

Overview

74-Token Vocabulary

Structural Tokens (4)

Entity Tokens (3)

Direction Tokens (4)

Input Tokens (4)

Position Tokens (20)

Event Type Tokens (7)

Value Tokens (11)

Length Tokens (21)

Total: 74 Tokens

Encoding Strategy

Snapshot Encoding

Delta Event Encoding

Tick Bundling

Episode Encoding

Why Keyframe + Delta?

Real Encoded Sequence Example

Decoding

Vocabulary Design Trade-offs

Hyperparameters

Next Steps