Skip to main content

Overview

Game Grammar uses a hybrid snapshot+delta encoding strategy, inspired by video codecs:
  • Snapshots (I-frames): Periodic full state captures for context recovery
  • Deltas (P-frames): High-frequency event tokens capturing state changes
This approach balances:
  • Compression: Delta events are compact
  • Recovery: Snapshots prevent error accumulation
  • Learnability: Transformers can attend to both absolute state and relative changes
From codec.py:21-23, the encoder takes snapshot_interval=16 and salience_threshold=Salience.TICK as parameters.

74-Token Vocabulary

The complete vocabulary from vocab.py covers all Snake gameplay:

Structural Tokens (4)

["BOS", "EOS", "TICK", "SNAP"]
  • BOS: Begin sequence (start of episode)
  • EOS: End sequence (episode terminated)
  • TICK: Time step marker (bundles events)
  • SNAP: Snapshot marker (full state follows)
From vocab.py:6.

Entity Tokens (3)

["PLAYER", "FOOD", "WALL"]
Entity types in Snake. Pac-Man would add "GHOST", "PELLET", "POWER_PELLET", etc. From vocab.py:9.

Direction Tokens (4)

["DIR_U", "DIR_D", "DIR_L", "DIR_R"]
Cardinal directions for snake heading. Used in snapshots. From vocab.py:12.

Input Tokens (4)

["INPUT_U", "INPUT_D", "INPUT_L", "INPUT_R"]
Player input events. Separate from direction because input can be ignored (e.g., reversals). From vocab.py:15.

Position Tokens (20)

# X coordinates (10)
["X0", "X1", "X2", ..., "X9"]

# Y coordinates (10)
["Y0", "Y1", "Y2", ..., "Y9"]
Absolute grid positions for 10x10 Snake. Positions are tokenized as separate X/Y pairs. From vocab.py:18-21.

Event Type Tokens (7)

["MOVE", "EAT", "GROW", "DIE_WALL", "DIE_SELF", "FOOD_SPAWN", "SCORE"]
The core event types from Snake gameplay. From vocab.py:24.

Value Tokens (11)

["V0", "V1", "V2", ..., "V10"]
Scores from 0-10. Scores above 10 are clamped to V10. From vocab.py:27.

Length Tokens (21)

["LEN1", "LEN2", "LEN3", ..., "LEN20", "LEN_LONG"]
Snake body length. LEN_LONG for lengths > 20. From vocab.py:30-31.

Total: 74 Tokens

assert VOCAB_SIZE == 74
From vocab.py:37. This is significantly larger than microgpt’s ~27 character vocabulary, but much smaller than language models (50K+ tokens).

Encoding Strategy

The EventCodec class in codec.py implements the hybrid approach.

Snapshot Encoding

A snapshot captures full state at a point in time:
def encode_snapshot(self, state: SnakeState) -> list[int]:
    length = len(state.body)
    len_tok = f"LEN{length}" if length <= 20 else "LEN_LONG"
    score_tok = f"V{min(state.score, 10)}"
    tokens = [
        "SNAP", "PLAYER",
        f"X{state.head[0]}", f"Y{state.head[1]}",
        _DIR_TOKEN[state.direction],
        len_tok,
        "FOOD",
        f"X{state.food[0]}", f"Y{state.food[1]}",
        "SCORE", score_tok,
    ]
    return [VOCAB[t] for t in tokens]
From codec.py:25-38. Example snapshot: SNAP PLAYER X5 Y5 DIR_R LEN1 FOOD X8 Y6 SCORE V0 This is 11 tokens encoding:
  • Player position (X5, Y5)
  • Player direction (DIR_R)
  • Snake length (LEN1)
  • Food position (X8, Y6)
  • Current score (V0)
Snapshots don’t include the full body — only head position and length. This is a design choice to reduce token count. The body can be reconstructed by tracking deltas.

Delta Event Encoding

Delta events encode state changes:
def encode_event(self, event: Event) -> list[int]:
    t = event.type
    tokens: list[str] = []
    
    if t.startswith("INPUT_"):
        tokens.append(t)
    elif t == "MOVE":
        x, y = event.payload["pos"]
        tokens.extend(["MOVE", f"X{x}", f"Y{y}"])
    elif t == "EAT":
        tokens.append("EAT")
    elif t == "GROW":
        length = event.payload["length"]
        len_tok = f"LEN{length}" if length <= 20 else "LEN_LONG"
        tokens.extend(["GROW", len_tok])
    # ... etc ...
    
    return [VOCAB[t] for t in tokens]
From codec.py:40-69. Examples:
  • INPUT_R → 1 token
  • MOVE X7 Y8 → 3 tokens
  • EAT → 1 token
  • GROW LEN3 → 2 tokens
  • FOOD_SPAWN X2 Y4 → 3 tokens

Tick Bundling

Events are grouped by tick:
def encode_tick_events(self, events: list[Event]) -> list[int]:
    filtered = [e for e in events if e.salience >= self.salience_threshold]
    if not filtered:
        return []
    tokens = [VOCAB["TICK"]]
    for event in filtered:
        tokens.extend(self.encode_event(event))
    return tokens
From codec.py:71-78. All events with the same tick are encoded after a single TICK token:
TICK INPUT_R MOVE X7 Y8 EAT GROW LEN3 FOOD_SPAWN X2 Y4 SCORE V2

Episode Encoding

Complete episodes are wrapped in BOS and EOS:
def encode_episode(
    self,
    events_by_tick: dict[int, list[Event]],
    states_by_tick: dict[int, SnakeState],
) -> list[int]:
    tokens = [VOCAB["BOS"]]
    
    max_tick = max(states_by_tick.keys()) if states_by_tick else 0
    
    for tick in range(max_tick + 1):
        # Snapshot at tick 0, every snapshot_interval, or on rule-effect events
        need_snapshot = (
            tick == 0
            or (tick % self.snapshot_interval == 0)
            or any(
                e.salience >= Salience.RULE_EFFECT
                for e in events_by_tick.get(tick, [])
            )
        )
        if need_snapshot and tick in states_by_tick:
            tokens.extend(self.encode_snapshot(states_by_tick[tick]))
        
        if tick in events_by_tick:
            tick_tokens = self.encode_tick_events(events_by_tick[tick])
            tokens.extend(tick_tokens)
    
    tokens.append(VOCAB["EOS"])
    return tokens
From codec.py:80-107.

Why Keyframe + Delta?

This hybrid approach mirrors video codec design (I-frames + P-frames):
Delta events are compact. MOVE X7 Y8 is 3 tokens, while a full state snapshot is 11 tokens.Pure delta encoding would be even more compact, but has problems (see next).
Pure delta encoding accumulates errors. If the model predicts one wrong move, all future positions are offset.Periodic snapshots reset the state, allowing the model to recover from mistakes.
Snapshots give the transformer absolute position context. Without them, the model must integrate all deltas from the start of the sequence.With a 64-token context window, the model can only “see” ~4-8 game ticks back. Snapshots provide grounding.
Some events trigger snapshots immediately. From codec.py:92-97, any event with salience >= Salience.RULE_EFFECT forces a snapshot.This ensures the model sees the full state after important transitions (eating food, dying, etc.).
Snapshot interval is configurable (snapshot_interval=16 in the current implementation). More snapshots = more context, fewer snapshots = better compression.

Real Encoded Sequence Example

From the README, a real Snake game encodes as:
BOS SNAP PLAYER X5 Y5 DIR_R LEN1 FOOD X8 Y6 SCORE V0
  TICK INPUT_L MOVE X4 Y5
  TICK INPUT_D MOVE X4 Y6
  TICK INPUT_D MOVE X4 Y7
  TICK INPUT_R MOVE X5 Y7
  TICK INPUT_R MOVE X6 Y7
  TICK INPUT_R MOVE X7 Y7
  TICK INPUT_R MOVE X8 Y7
  TICK INPUT_U MOVE X8 Y6 EAT GROW LEN2 FOOD_SPAWN X3 Y2 SCORE V1
  SNAP PLAYER X8 Y6 DIR_U LEN2 FOOD X3 Y2 SCORE V1
  TICK INPUT_U MOVE X8 Y5
  ...
  TICK INPUT_R MOVE X10 Y5 DIE_WALL
EOS
Note:
  • Initial snapshot at tick 0
  • Delta events for each move
  • Snapshot after eating (rule effect)
  • Death event followed by EOS

Decoding

The codec includes a decoder for validation:
def decode(self, tokens: list[int]) -> list[dict]:
    """Decode token sequence into a list of parsed records for validation."""
    names = [ID_TO_TOKEN[t] for t in tokens]
    records: list[dict] = []
    i = 0
    while i < len(names):
        tok = names[i]
        if tok == "BOS":
            records.append({"type": "BOS"})
            i += 1
        elif tok == "SNAP":
            # Parse 11-token snapshot
            rec = {"type": "SNAP"}
            if i + 10 < len(names):
                rec["player_x"] = names[i + 2]
                rec["player_y"] = names[i + 3]
                # ... etc ...
            records.append(rec)
            i += 11
        elif tok == "MOVE":
            rec = {"type": "MOVE"}
            if i + 2 < len(names):
                rec["x"] = names[i + 1]
                rec["y"] = names[i + 2]
            records.append(rec)
            i += 3
        # ... etc ...
    return records
From codec.py:109-179. The decoder parses multi-token events back into structured records for the validator to check.

Vocabulary Design Trade-offs

Pros:
  • Absolute positions as single tokens (X7, Y8)
  • Clear event structure
  • Human-readable sequences
  • Small vocabulary size
Cons:
  • Grid size is baked in (10x10)
  • Scaling to larger grids requires more tokens
  • Each game needs a custom vocabulary
From the README: “The most valuable contributions right now are feedback on the tokenization approaches — this is the core unsolved problem.”

Hyperparameters

From codec.py:21-23:
class EventCodec:
    def __init__(self, snapshot_interval=16, salience_threshold=Salience.TICK):
        self.snapshot_interval = snapshot_interval
        self.salience_threshold = salience_threshold
  • snapshot_interval=16: Insert snapshot every 16 ticks (+ on rule effects)
  • salience_threshold=Salience.TICK: Include all events (no filtering)
These are tunable. Higher snapshot_interval = more compression, less context.

Next Steps

Event Streams

See how games produce events

Transformer Architecture

Learn how the model processes tokens

Build docs developers (and LLMs) love