Validation

Overview

The validation module provides a three-tier system for checking the validity of generated token sequences:

Structural - Basic syntax and formatting
Physical - Spatial consistency and movement rules
Rules - Game logic and state transitions

Each tier builds on the previous one, creating a hierarchy of validity checks.

Functions

check_structural

def check_structural(tokens: list[int]) -> dict[str, bool]

Tier 1: Check basic structural validity - BOS/EOS markers, TICK boundaries, and SNAP format.

tokens

list[int]

Token IDs to validate.

result

dict[str, bool]

Dictionary with the following keys:

has_bos: Sequence starts with BOS
has_eos: Sequence ends with EOS
snap_format: All SNAP records follow correct 11-token format
has_tick: Sequence contains at least one TICK marker
structural_pass: True if all structural checks pass

SNAP format validation: Each SNAP must follow the exact pattern:

SNAP PLAYER X_ Y_ DIR_ LEN_ FOOD X_ Y_ SCORE V_

Example:

from game_grammar.validate import check_structural

# Valid sequence
tokens = [0, 3, 4, 15, 25, 7, 53, 5, 17, 27, 41, 42, 1]  # BOS SNAP ... EOS
result = check_structural(tokens)
print(result)
# {
#     "has_bos": True,
#     "has_eos": True,
#     "snap_format": True,
#     "has_tick": False,
#     "structural_pass": True
# }

# Invalid: missing BOS
tokens = [2, 11, 1]  # TICK INPUT_U EOS
result = check_structural(tokens)
print(result["has_bos"])  # False
print(result["structural_pass"])  # False

check_physical

def check_physical(tokens: list[int]) -> dict[str, bool]

Tier 2: Check physical constraints - positions within bounds and consecutive moves are adjacent.

tokens

list[int]

Token IDs to validate.

result

dict[str, bool]

Dictionary with the following keys:

positions_in_bounds: All X and Y coordinates are 0-9
moves_adjacent: Consecutive MOVE events differ by Manhattan distance of 1
physical_pass: True if all physical checks pass

Movement validation: Consecutive MOVE events must have positions that differ by exactly 1 in Manhattan distance:

dist = abs(x2 - x1) + abs(y2 - y1)
assert dist == 1  # Must move to adjacent cell

Example:

from game_grammar.validate import check_physical
from game_grammar.vocab import VOCAB

# Valid: moves from (3,4) to (4,4) to (4,5)
tokens = [
    VOCAB["MOVE"], VOCAB["X3"], VOCAB["Y4"],  # First move
    VOCAB["MOVE"], VOCAB["X4"], VOCAB["Y4"],  # Adjacent (right)
    VOCAB["MOVE"], VOCAB["X4"], VOCAB["Y5"],  # Adjacent (down)
]
result = check_physical(tokens)
print(result["moves_adjacent"])  # True
print(result["physical_pass"])  # True

# Invalid: positions out of bounds
tokens_invalid = [
    VOCAB["MOVE"], VOCAB["X9"], VOCAB["Y9"],
    # Next move would be X10 (out of bounds for 10x10 grid)
]

Position validation only checks tokens in the format X0-X9 and Y0-Y9. It skips malformed tokens.

check_rules

def check_rules(tokens: list[int]) -> dict[str, bool]

Tier 3: Check game rule compliance - event causality and terminal states.

tokens

list[int]

Token IDs to validate.

result

dict[str, bool]

Dictionary with the following keys:

eat_triggers_grow: Each EAT is followed by GROW and FOOD_SPAWN before next TICK
die_ends_game: After DIE_WALL or DIE_SELF, only SCORE and EOS remain
rule_pass: True if all rule checks pass

Rule checks:

EAT causality: EAT → GROW + FOOD_SPAWN
- When the snake eats food, it must grow and food must respawn
- These events must occur before the next TICK
Death is terminal: DIE_WALL | DIE_SELF → SCORE? → EOS
- After death, the game ends
- Only SCORE updates and EOS are allowed after death

Example:

from game_grammar.validate import check_rules
from game_grammar.vocab import VOCAB

# Valid: EAT followed by GROW and FOOD_SPAWN
tokens = [
    VOCAB["BOS"],
    VOCAB["TICK"],
    VOCAB["EAT"],
    VOCAB["GROW"], VOCAB["LEN4"],
    VOCAB["FOOD_SPAWN"], VOCAB["X7"], VOCAB["Y3"],
    VOCAB["TICK"],  # Next tick
    VOCAB["EOS"],
]
result = check_rules(tokens)
print(result["eat_triggers_grow"])  # True

# Valid: Death ends game
tokens_death = [
    VOCAB["BOS"],
    VOCAB["TICK"],
    VOCAB["DIE_WALL"],
    VOCAB["SCORE"], VOCAB["V5"],
    VOCAB["EOS"],
]
result = check_rules(tokens_death)
print(result["die_ends_game"])  # True
print(result["rule_pass"])  # True

# Invalid: EAT without GROW
tokens_invalid = [
    VOCAB["TICK"],
    VOCAB["EAT"],
    VOCAB["TICK"],  # Missing GROW and FOOD_SPAWN
]
result = check_rules(tokens_invalid)
print(result["eat_triggers_grow"])  # False

validity_rate

def validity_rate(samples: list[list[int]]) -> dict[str, float]

Compute pass rates for all three tiers across a batch of samples.

samples

list[list[int]]

List of token sequences to validate.

rates

dict[str, float]

Dictionary with pass rates (0.0 to 1.0) for:

structural: Tier 1 pass rate
physical: Tier 2 pass rate
rules: Tier 3 pass rate
full: All tiers pass rate

Example:

from game_grammar.validate import validity_rate
from game_grammar.data import collect_episodes
from game_grammar.agents import RandomAgent

# Collect episodes from real gameplay
episodes = collect_episodes(
    n=100,
    agent_mix=[(RandomAgent(), 1.0)],
    seed=42
)

# Check validity
rates = validity_rate(episodes)
print(rates)
# {
#     "structural": 1.0,    # 100% pass
#     "physical": 1.0,      # 100% pass
#     "rules": 1.0,         # 100% pass
#     "full": 1.0           # 100% pass
# }

# Validate model-generated samples
from your_model import generate_samples
generated = generate_samples(n=1000)
rates = validity_rate(generated)
print(rates)
# {
#     "structural": 0.92,   # 92% pass tier 1
#     "physical": 0.78,     # 78% pass tier 2
#     "rules": 0.45,        # 45% pass tier 3
#     "full": 0.45          # 45% pass all tiers
# }

The full validity rate is the most important metric - it represents sequences that pass all three tiers.

Validation Hierarchy

The three tiers form a progressive hierarchy:

┌─────────────────────────────────────────┐
│ Tier 1: Structural                      │
│ - BOS/EOS markers                       │
│ - SNAP format (11 tokens)               │
│ - TICK boundaries present               │
└─────────────────────────────────────────┘
              ↓ (builds on)
┌─────────────────────────────────────────┐
│ Tier 2: Physical                        │
│ - Positions in bounds (0-9)             │
│ - Consecutive moves adjacent            │
│ - Manhattan distance = 1                │
└─────────────────────────────────────────┘
              ↓ (builds on)
┌─────────────────────────────────────────┐
│ Tier 3: Rules                           │
│ - EAT → GROW + FOOD_SPAWN               │
│ - DIE → game ends (only SCORE, EOS)     │
│ - Event causality preserved             │
└─────────────────────────────────────────┘

Typical pass rates for model-generated samples:

Tier 1 (Structural): 90-95% - Models learn syntax quickly
Tier 2 (Physical): 70-85% - Spatial coherence is harder
Tier 3 (Rules): 40-70% - Causal reasoning is most challenging

Usage Patterns

Evaluating model quality

from game_grammar.validate import validity_rate

# Compare model checkpoints
for epoch in [10, 20, 30, 40, 50]:
    model = load_checkpoint(f"model_epoch_{epoch}.pt")
    samples = model.generate(n=1000)
    rates = validity_rate(samples)
    
    print(f"Epoch {epoch}:")
    print(f"  Structural: {rates['structural']:.2%}")
    print(f"  Physical: {rates['physical']:.2%}")
    print(f"  Rules: {rates['rules']:.2%}")
    print(f"  Full: {rates['full']:.2%}")

Filtering valid samples

from game_grammar.validate import check_structural, check_physical, check_rules

def is_fully_valid(tokens: list[int]) -> bool:
    """Check if a sequence passes all three tiers."""
    return (
        check_structural(tokens)["structural_pass"]
        and check_physical(tokens)["physical_pass"]
        and check_rules(tokens)["rule_pass"]
    )

# Filter generated samples
generated = model.generate(n=10000)
valid_samples = [s for s in generated if is_fully_valid(s)]
print(f"Valid: {len(valid_samples)} / {len(generated)}")

Diagnosing model failures

from game_grammar.validate import check_structural, check_physical, check_rules

def diagnose_failures(samples: list[list[int]]):
    """Identify which tier is the biggest bottleneck."""
    tier1_only = 0
    tier2_only = 0
    tier3_only = 0
    all_pass = 0
    
    for sample in samples:
        s = check_structural(sample)["structural_pass"]
        p = check_physical(sample)["physical_pass"]
        r = check_rules(sample)["rule_pass"]
        
        if s and p and r:
            all_pass += 1
        elif s and p:
            tier3_only += 1  # Failed tier 3
        elif s:
            tier2_only += 1  # Failed tier 2
        else:
            tier1_only += 1  # Failed tier 1
    
    n = len(samples)
    print(f"Pass all: {all_pass/n:.2%}")
    print(f"Fail at tier 1: {tier1_only/n:.2%}")
    print(f"Fail at tier 2: {tier2_only/n:.2%}")
    print(f"Fail at tier 3: {tier3_only/n:.2%}")

generated = model.generate(n=1000)
diagnose_failures(generated)

Validation Best Practices

Training evaluation: Track validity_rate on held-out test samples every N epochs to monitor model quality.

Sample filtering: For downstream tasks, filter to only fully valid sequences to avoid corrupted data.

Passing all three tiers does NOT guarantee the sequence represents a valid gameplay trajectory - it only checks local constraints. Full simulation replay is needed for complete validation.

Core

Game & Agents

Data Pipeline

Scripts

Overview

Functions

check_structural

check_physical

check_rules

validity_rate

Validation Hierarchy

Usage Patterns

Evaluating model quality

Filtering valid samples

Diagnosing model failures

Validation Best Practices

Build docs developers (and LLMs) love

Core

Game & Agents

Data Pipeline

Scripts

​Overview

​Functions

​check_structural

​check_physical

​check_rules

​validity_rate

​Validation Hierarchy

​Usage Patterns

​Evaluating model quality

​Filtering valid samples

​Diagnosing model failures

​Validation Best Practices

Build docs developers (and LLMs) love

Overview

Functions

check_structural

check_physical

check_rules

validity_rate

Validation Hierarchy

Usage Patterns

Evaluating model quality

Filtering valid samples

Diagnosing model failures

Validation Best Practices