This demo uses a deliberately flawed deterministic policy:
function badPolicy(pos) { const key = `${pos[0]},${pos[1]}`; const policy = { '0,0': 1, // At (0,0), go right '0,1': 1, // At (0,1), go right '0,2': 2, // At (0,2), go down '1,2': 3, // At (1,2), go left ← BUG: should be 2 (down) '1,0': 0, // At (1,0), go up }; return policy[key] ?? 1; // Default to right if state not in policy}
From position (1,2), the agent should go down (action 2) to reach the goal at (2,2). Instead, the policy tells it to go left (action 3), which tries to move into the wall at (1,1).Because the wall blocks movement, the agent stays at (1,2) and tries the same action again on the next step, creating an infinite loop.
The policy is purely deterministic with no exploration:
return policy[key] ?? 1; // Always returns the same action for a given state
Once the agent reaches position (1,2), it will always choose action 3 (left). There’s no randomness or exploration to help it discover the correct action.
Check if an action will cause progress before taking it:
function validateAction(pos, action) { const nextPos = simulateMove(pos, action); // Don't take actions that keep you in the same place if (nextPos[0] === pos[0] && nextPos[1] === pos[1]) { return false; } return true;}
function goodPolicy(pos) { const key = `${pos[0]},${pos[1]}`; const policy = { '0,0': 1, // At (0,0), go right '0,1': 1, // At (0,1), go right '0,2': 2, // At (0,2), go down '1,2': 2, // At (1,2), go down ← FIXED '1,0': 0, // At (1,0), go up '2,0': 1, // At (2,0), go right '2,1': 1, // At (2,1), go right }; return policy[key] ?? 1;}
The only change is line 5: '1,2': 2 instead of '1,2': 3. This single fix allows the agent to reach the goal.
In real RL systems, policies are learned from data and may contain subtle bugs that aren’t obvious from inspection. Always test policies thoroughly and include safeguards.