Evaluation

Generated sequences are evaluated using a three-tier validation system that checks structural correctness, physical plausibility, and rule compliance. This approach validates that the model learns game grammar at multiple levels of abstraction.

Validation Tiers

The validation system implements three hierarchical tiers:

Tier 1: Structural Validity

Verifies basic token sequence structure and format constraints.

game_grammar/validate.py

def check_structural(tokens: list[int]) -> dict[str, bool]:
    """Tier 1: BOS/EOS present, TICK separates events, SNAP has correct format."""
    names = [ID_TO_TOKEN.get(t, "?") for t in tokens]

    has_bos = len(names) > 0 and names[0] == "BOS"
    has_eos = len(names) > 0 and names[-1] == "EOS"

    # Check SNAP format: SNAP PLAYER X_ Y_ DIR_ LEN_ FOOD X_ Y_ SCORE V_
    snap_ok = True
    for i, tok in enumerate(names):
        if tok == "SNAP":
            if i + 10 >= len(names):
                snap_ok = False
                break
            if names[i + 1] != "PLAYER":
                snap_ok = False
                break
            if not names[i + 2].startswith("X"):
                snap_ok = False
                break
            if not names[i + 3].startswith("Y"):
                snap_ok = False
                break
            if not names[i + 4].startswith("DIR_"):
                snap_ok = False
                break
            if not names[i + 5].startswith("LEN"):
                snap_ok = False
                break
            if names[i + 6] != "FOOD":
                snap_ok = False
                break
            if not names[i + 7].startswith("X"):
                snap_ok = False
                break
            if not names[i + 8].startswith("Y"):
                snap_ok = False
                break
            if names[i + 9] != "SCORE":
                snap_ok = False
                break
            if not names[i + 10].startswith("V"):
                snap_ok = False
                break

    # TICK should appear and events should follow
    has_tick = "TICK" in names

    return {
        "has_bos": has_bos,
        "has_eos": has_eos,
        "snap_format": snap_ok,
        "has_tick": has_tick,
        "structural_pass": has_bos and has_eos and snap_ok,
    }

Checks:

Sequence starts with BOS token
Sequence ends with EOS token
SNAP tokens follow exact format: SNAP PLAYER X_ Y_ DIR_ LEN_ FOOD X_ Y_ SCORE V_
TICK tokens present to separate game steps

Structural validity is the lowest bar — it verifies the model learned the basic “syntax” of the token sequence format.

Tier 2: Physical Validity

Verifies that gameplay obeys basic physics constraints.

game_grammar/validate.py

def check_physical(tokens: list[int]) -> dict[str, bool]:
    """Tier 2: positions in bounds, consecutive MOVEs differ by 1 cell."""
    names = [ID_TO_TOKEN.get(t, "?") for t in tokens]

    positions_ok = True
    for tok in names:
        if tok.startswith("X") and tok != "X?" and len(tok) == 2:
            val = int(tok[1])
            if val < 0 or val > 9:
                positions_ok = False
        if tok.startswith("Y") and tok != "Y?" and len(tok) == 2:
            val = int(tok[1])
            if val < 0 or val > 9:
                positions_ok = False

    # Check consecutive MOVEs differ by 1 cell (Manhattan distance)
    moves_ok = True
    last_move_pos = None
    for i, tok in enumerate(names):
        if tok == "MOVE" and i + 2 < len(names):
            x_tok, y_tok = names[i + 1], names[i + 2]
            if x_tok.startswith("X") and y_tok.startswith("Y") and len(x_tok) == 2 and len(y_tok) == 2:
                try:
                    x, y = int(x_tok[1]), int(y_tok[1])
                    if last_move_pos is not None:
                        lx, ly = last_move_pos
                        dist = abs(x - lx) + abs(y - ly)
                        if dist != 1:
                            moves_ok = False
                    last_move_pos = (x, y)
                except ValueError:
                    pass

    return {
        "positions_in_bounds": positions_ok,
        "moves_adjacent": moves_ok,
        "physical_pass": positions_ok and moves_ok,
    }

Checks:

All X/Y coordinates are in bounds (0-9 for 10×10 grid)
Consecutive MOVE events differ by exactly 1 cell (Manhattan distance)
No teleportation or invalid spatial transitions

Physical validity verifies the model learned that movement is constrained to adjacent cells — a core physics rule of Snake.

Tier 3: Rule Validity

Verifies that gameplay obeys game-specific rules and causal relationships.

game_grammar/validate.py

def check_rules(tokens: list[int]) -> dict[str, bool]:
    """Tier 3: EAT→GROW+FOOD_SPAWN, DIE→EOS (game ends), LEN increments by 1."""
    names = [ID_TO_TOKEN.get(t, "?") for t in tokens]

    # EAT should be followed (eventually, before next TICK) by GROW and FOOD_SPAWN
    eat_grow_ok = True
    i = 0
    while i < len(names):
        if names[i] == "EAT":
            # Scan until next TICK or EOS
            rest = names[i+1:]
            tick_idx = len(rest)
            for j, t in enumerate(rest):
                if t in ("TICK", "EOS"):
                    tick_idx = j
                    break
            segment = rest[:tick_idx]
            if "GROW" not in segment or "FOOD_SPAWN" not in segment:
                eat_grow_ok = False
                break
        i += 1

    # DIE → should be followed by EOS (no more game events after death)
    die_eos_ok = True
    for i, tok in enumerate(names):
        if tok in ("DIE_WALL", "DIE_SELF"):
            rest_after = [t for t in names[i+1:] if t not in ("EOS", "SCORE")]
            # After death, only SCORE and EOS should remain
            for t in rest_after:
                if t not in ("V0", "V1", "V2", "V3", "V4", "V5",
                             "V6", "V7", "V8", "V9", "V10"):
                    die_eos_ok = False
                    break

    return {
        "eat_triggers_grow": eat_grow_ok,
        "die_ends_game": die_eos_ok,
        "rule_pass": eat_grow_ok and die_eos_ok,
    }

Checks:

EAT event triggers both GROW and FOOD_SPAWN within the same tick
DIE_WALL or DIE_SELF events are followed only by SCORE and EOS (game ends)
No events occur after death (terminal state enforcement)

Rule validity is the highest bar — it verifies the model learned causal dependencies between events (“eating causes growth and food respawn”).

Validation Results

Typical results after training (200 episodes, 5000 steps, 20 samples):

Tier	Metric	Result
Tier 1	Structural validity	45%
Tier 2	Physical validity	95%
Tier 3	Rule validity	100%
All	Full validity (all tiers)	45%

Structural validity is low (45%) because the model often exceeds the 64-token context window without generating EOS. The sequences contain valid gameplay that runs longer than the buffer allows. This is a context limit issue, not a grammar learning failure.

Key insight: The model learns physical constraints (95%) and rule dependencies (100%) more reliably than structural boundaries.

Sampling and Validation

The scripts/sample.py script generates sequences and validates them:

scripts/sample.py

from game_grammar.model import GameGPT
from game_grammar.vocab import VOCAB, VOCAB_SIZE, ID_TO_TOKEN
from game_grammar.validate import validity_rate, check_structural, check_physical, check_rules

model = GameGPT(
    vocab_size=VOCAB_SIZE,
    n_layer=2,
    n_embd=32,
    block_size=64,
    n_head=4,
    seed=42,
)
model.load_weights(weights_path)

bos_id = VOCAB["BOS"]
eos_id = VOCAB["EOS"]

samples = []
for i in range(20):
    tokens = model.sample(bos_id, eos_id, temperature=0.5)
    samples.append(tokens)
    names = [ID_TO_TOKEN[t] for t in tokens]
    label = " ".join(names[:40])
    if len(names) > 40:
        label += " ..."
    s_pass = "S" if check_structural(tokens)["structural_pass"] else "-"
    p_pass = "P" if check_physical(tokens)["physical_pass"] else "-"
    r_pass = "R" if check_rules(tokens)["rule_pass"] else "-"
    print(f"[{s_pass}{p_pass}{r_pass}] sample {i+1:2d} ({len(tokens):3d} tok): {label}")

print("\n--- Validity rates ---")
rates = validity_rate(samples)
for tier, rate in rates.items():
    print(f"  {tier:12s}: {rate:.0%}")

Sample Output

[SPR] sample  1 ( 64 tok): BOS SNAP PLAYER X5 Y5 DIR_R LEN1 FOOD X8 Y6 SCORE V0 TICK INPUT_L MOVE X4 Y5 TICK INPUT_D MOVE X4 Y6 TICK INPUT_D MOVE X4 Y7 ...
[SPR] sample  2 ( 64 tok): BOS SNAP PLAYER X3 Y3 DIR_U LEN1 FOOD X7 Y2 SCORE V0 TICK INPUT_R MOVE X4 Y3 TICK INPUT_U MOVE X4 Y2 TICK INPUT_R MOVE X5 Y2 ...
[-PR] sample  3 ( 64 tok): BOS SNAP PLAYER X2 Y8 DIR_D LEN1 FOOD X6 Y9 SCORE V0 TICK INPUT_R MOVE X3 Y8 TICK INPUT_D MOVE X3 Y9 TICK INPUT_R MOVE X4 Y9 ...
[SPR] sample  4 ( 64 tok): BOS SNAP PLAYER X7 Y4 DIR_L LEN1 FOOD X2 Y5 SCORE V0 TICK INPUT_D MOVE X7 Y5 TICK INPUT_L MOVE X6 Y5 TICK INPUT_L MOVE X5 Y5 ...
...

--- Validity rates ---
  structural  : 45%
  physical    : 95%
  rules       : 100%
  full        : 45%

Legend: [SPR] = Structural + Physical + Rules valid, [-PR] = Physical + Rules valid only

validity_rate() Function

Computes aggregate pass rates across a batch of samples:

game_grammar/validate.py

def validity_rate(samples: list[list[int]]) -> dict[str, float]:
    """Compute pass rates per tier across a batch of samples."""
    n = len(samples)
    if n == 0:
        return {}

    structural = sum(1 for s in samples if check_structural(s)["structural_pass"]) / n
    physical = sum(1 for s in samples if check_physical(s)["physical_pass"]) / n
    rules = sum(1 for s in samples if check_rules(s)["rule_pass"]) / n
    full = sum(
        1 for s in samples
        if check_structural(s)["structural_pass"]
        and check_physical(s)["physical_pass"]
        and check_rules(s)["rule_pass"]
    ) / n

    return {
        "structural": structural,
        "physical": physical,
        "rules": rules,
        "full": full,
    }

Returns: Dictionary with fraction of samples passing each tier

The three-tier system isolates different failure modes. A sequence can pass physical checks but fail rules (learned movement but not causality) or pass rules but fail structural checks (learned grammar but not sequence boundaries).

Overview

Concepts

Training

Games

Validation Tiers

Tier 1: Structural Validity

Tier 2: Physical Validity

Tier 3: Rule Validity

Validation Results

Sampling and Validation

Sample Output

validity_rate() Function

Build docs developers (and LLMs) love

Overview

Concepts

Training

Games

​Validation Tiers

​Tier 1: Structural Validity

​Tier 2: Physical Validity

​Tier 3: Rule Validity

​Validation Results

​Sampling and Validation

​Sample Output

​validity_rate() Function

Build docs developers (and LLMs) love

Validation Tiers

Tier 1: Structural Validity

Tier 2: Physical Validity

Tier 3: Rule Validity

Validation Results

Sampling and Validation

Sample Output

validity_rate() Function