Overview
The validation module provides a three-tier system for checking the validity of generated token sequences:
- Structural - Basic syntax and formatting
- Physical - Spatial consistency and movement rules
- Rules - Game logic and state transitions
Each tier builds on the previous one, creating a hierarchy of validity checks.
Functions
check_structural
def check_structural(tokens: list[int]) -> dict[str, bool]
Tier 1: Check basic structural validity - BOS/EOS markers, TICK boundaries, and SNAP format.
Dictionary with the following keys:
has_bos: Sequence starts with BOS
has_eos: Sequence ends with EOS
snap_format: All SNAP records follow correct 11-token format
has_tick: Sequence contains at least one TICK marker
structural_pass: True if all structural checks pass
SNAP format validation:
Each SNAP must follow the exact pattern:
SNAP PLAYER X_ Y_ DIR_ LEN_ FOOD X_ Y_ SCORE V_
Example:
from game_grammar.validate import check_structural
# Valid sequence
tokens = [0, 3, 4, 15, 25, 7, 53, 5, 17, 27, 41, 42, 1] # BOS SNAP ... EOS
result = check_structural(tokens)
print(result)
# {
# "has_bos": True,
# "has_eos": True,
# "snap_format": True,
# "has_tick": False,
# "structural_pass": True
# }
# Invalid: missing BOS
tokens = [2, 11, 1] # TICK INPUT_U EOS
result = check_structural(tokens)
print(result["has_bos"]) # False
print(result["structural_pass"]) # False
check_physical
def check_physical(tokens: list[int]) -> dict[str, bool]
Tier 2: Check physical constraints - positions within bounds and consecutive moves are adjacent.
Dictionary with the following keys:
positions_in_bounds: All X and Y coordinates are 0-9
moves_adjacent: Consecutive MOVE events differ by Manhattan distance of 1
physical_pass: True if all physical checks pass
Movement validation:
Consecutive MOVE events must have positions that differ by exactly 1 in Manhattan distance:
dist = abs(x2 - x1) + abs(y2 - y1)
assert dist == 1 # Must move to adjacent cell
Example:
from game_grammar.validate import check_physical
from game_grammar.vocab import VOCAB
# Valid: moves from (3,4) to (4,4) to (4,5)
tokens = [
VOCAB["MOVE"], VOCAB["X3"], VOCAB["Y4"], # First move
VOCAB["MOVE"], VOCAB["X4"], VOCAB["Y4"], # Adjacent (right)
VOCAB["MOVE"], VOCAB["X4"], VOCAB["Y5"], # Adjacent (down)
]
result = check_physical(tokens)
print(result["moves_adjacent"]) # True
print(result["physical_pass"]) # True
# Invalid: positions out of bounds
tokens_invalid = [
VOCAB["MOVE"], VOCAB["X9"], VOCAB["Y9"],
# Next move would be X10 (out of bounds for 10x10 grid)
]
Position validation only checks tokens in the format X0-X9 and Y0-Y9. It skips malformed tokens.
check_rules
def check_rules(tokens: list[int]) -> dict[str, bool]
Tier 3: Check game rule compliance - event causality and terminal states.
Dictionary with the following keys:
eat_triggers_grow: Each EAT is followed by GROW and FOOD_SPAWN before next TICK
die_ends_game: After DIE_WALL or DIE_SELF, only SCORE and EOS remain
rule_pass: True if all rule checks pass
Rule checks:
-
EAT causality: EAT → GROW + FOOD_SPAWN
- When the snake eats food, it must grow and food must respawn
- These events must occur before the next TICK
-
Death is terminal: DIE_WALL | DIE_SELF → SCORE? → EOS
- After death, the game ends
- Only SCORE updates and EOS are allowed after death
Example:
from game_grammar.validate import check_rules
from game_grammar.vocab import VOCAB
# Valid: EAT followed by GROW and FOOD_SPAWN
tokens = [
VOCAB["BOS"],
VOCAB["TICK"],
VOCAB["EAT"],
VOCAB["GROW"], VOCAB["LEN4"],
VOCAB["FOOD_SPAWN"], VOCAB["X7"], VOCAB["Y3"],
VOCAB["TICK"], # Next tick
VOCAB["EOS"],
]
result = check_rules(tokens)
print(result["eat_triggers_grow"]) # True
# Valid: Death ends game
tokens_death = [
VOCAB["BOS"],
VOCAB["TICK"],
VOCAB["DIE_WALL"],
VOCAB["SCORE"], VOCAB["V5"],
VOCAB["EOS"],
]
result = check_rules(tokens_death)
print(result["die_ends_game"]) # True
print(result["rule_pass"]) # True
# Invalid: EAT without GROW
tokens_invalid = [
VOCAB["TICK"],
VOCAB["EAT"],
VOCAB["TICK"], # Missing GROW and FOOD_SPAWN
]
result = check_rules(tokens_invalid)
print(result["eat_triggers_grow"]) # False
validity_rate
def validity_rate(samples: list[list[int]]) -> dict[str, float]
Compute pass rates for all three tiers across a batch of samples.
List of token sequences to validate.
Dictionary with pass rates (0.0 to 1.0) for:
structural: Tier 1 pass rate
physical: Tier 2 pass rate
rules: Tier 3 pass rate
full: All tiers pass rate
Example:
from game_grammar.validate import validity_rate
from game_grammar.data import collect_episodes
from game_grammar.agents import RandomAgent
# Collect episodes from real gameplay
episodes = collect_episodes(
n=100,
agent_mix=[(RandomAgent(), 1.0)],
seed=42
)
# Check validity
rates = validity_rate(episodes)
print(rates)
# {
# "structural": 1.0, # 100% pass
# "physical": 1.0, # 100% pass
# "rules": 1.0, # 100% pass
# "full": 1.0 # 100% pass
# }
# Validate model-generated samples
from your_model import generate_samples
generated = generate_samples(n=1000)
rates = validity_rate(generated)
print(rates)
# {
# "structural": 0.92, # 92% pass tier 1
# "physical": 0.78, # 78% pass tier 2
# "rules": 0.45, # 45% pass tier 3
# "full": 0.45 # 45% pass all tiers
# }
The full validity rate is the most important metric - it represents sequences that pass all three tiers.
Validation Hierarchy
The three tiers form a progressive hierarchy:
┌─────────────────────────────────────────┐
│ Tier 1: Structural │
│ - BOS/EOS markers │
│ - SNAP format (11 tokens) │
│ - TICK boundaries present │
└─────────────────────────────────────────┘
↓ (builds on)
┌─────────────────────────────────────────┐
│ Tier 2: Physical │
│ - Positions in bounds (0-9) │
│ - Consecutive moves adjacent │
│ - Manhattan distance = 1 │
└─────────────────────────────────────────┘
↓ (builds on)
┌─────────────────────────────────────────┐
│ Tier 3: Rules │
│ - EAT → GROW + FOOD_SPAWN │
│ - DIE → game ends (only SCORE, EOS) │
│ - Event causality preserved │
└─────────────────────────────────────────┘
Typical pass rates for model-generated samples:
- Tier 1 (Structural): 90-95% - Models learn syntax quickly
- Tier 2 (Physical): 70-85% - Spatial coherence is harder
- Tier 3 (Rules): 40-70% - Causal reasoning is most challenging
Usage Patterns
Evaluating model quality
from game_grammar.validate import validity_rate
# Compare model checkpoints
for epoch in [10, 20, 30, 40, 50]:
model = load_checkpoint(f"model_epoch_{epoch}.pt")
samples = model.generate(n=1000)
rates = validity_rate(samples)
print(f"Epoch {epoch}:")
print(f" Structural: {rates['structural']:.2%}")
print(f" Physical: {rates['physical']:.2%}")
print(f" Rules: {rates['rules']:.2%}")
print(f" Full: {rates['full']:.2%}")
Filtering valid samples
from game_grammar.validate import check_structural, check_physical, check_rules
def is_fully_valid(tokens: list[int]) -> bool:
"""Check if a sequence passes all three tiers."""
return (
check_structural(tokens)["structural_pass"]
and check_physical(tokens)["physical_pass"]
and check_rules(tokens)["rule_pass"]
)
# Filter generated samples
generated = model.generate(n=10000)
valid_samples = [s for s in generated if is_fully_valid(s)]
print(f"Valid: {len(valid_samples)} / {len(generated)}")
Diagnosing model failures
from game_grammar.validate import check_structural, check_physical, check_rules
def diagnose_failures(samples: list[list[int]]):
"""Identify which tier is the biggest bottleneck."""
tier1_only = 0
tier2_only = 0
tier3_only = 0
all_pass = 0
for sample in samples:
s = check_structural(sample)["structural_pass"]
p = check_physical(sample)["physical_pass"]
r = check_rules(sample)["rule_pass"]
if s and p and r:
all_pass += 1
elif s and p:
tier3_only += 1 # Failed tier 3
elif s:
tier2_only += 1 # Failed tier 2
else:
tier1_only += 1 # Failed tier 1
n = len(samples)
print(f"Pass all: {all_pass/n:.2%}")
print(f"Fail at tier 1: {tier1_only/n:.2%}")
print(f"Fail at tier 2: {tier2_only/n:.2%}")
print(f"Fail at tier 3: {tier3_only/n:.2%}")
generated = model.generate(n=1000)
diagnose_failures(generated)
Validation Best Practices
Training evaluation: Track validity_rate on held-out test samples every N epochs to monitor model quality.
Sample filtering: For downstream tasks, filter to only fully valid sequences to avoid corrupted data.
Passing all three tiers does NOT guarantee the sequence represents a valid gameplay trajectory - it only checks local constraints. Full simulation replay is needed for complete validation.