The envStep Function
TheenvStep function simulates the environment’s response to an agent action. It implements the core transition dynamics of the grid world. Here’s the complete implementation:
Function Signature
pos: Current position as[row, col]arrayaction: Integer action (0=up, 1=right, 2=down, 3=left)
pos: New position as[row, col]arrayreward: Numerical reward for this transitiondone: Boolean indicating episode termination
Step 1: Movement Calculation
The function first calculates the intended new position based on the action:[row, col] into r and c, then modifies them based on the action.
Step 2: Boundary Checking
The grid has hard boundaries at[0, SIZE-1]. The function clamps positions to stay within bounds:
Math.min(SIZE - 1, r): Ensuresr ≤ 2(prevents going off the bottom/right)Math.max(0, ...): Ensuresr ≥ 0(prevents going off the top/left)
(0,1) and tries to go up (action 0), r would become -1, but Math.max(0, -1) clamps it back to 0.
Step 3: Wall Collision Detection
The grid has an immovable wall at positionWALL = [1,1]. If the agent tries to enter the wall, it stays in place:
(r, c) equals the wall position (1, 1). If so, it reverts to the original position pos.
Example:
- Agent at
(1,0), action right (1) → tries to go to(1,1)→ wall collision → stays at(1,0) - Agent at
(1,2), action left (3) → tries to go to(1,1)→ wall collision → stays at(1,2)⚠️ This is where the cycle occurs
Step 4: Reward Structure
The reward function is sparse with a large goal reward:- Reaching the goal
(2,2): Reward =+10, episode ends (done = true) - Any other step: Reward =
-0.1(small negative reward to encourage shorter paths)
- Incentivizes reaching the goal quickly
- Penalizes wandering or getting stuck
- Creates a shortest-path optimal policy
Step 5: Return Format
The function returns an object with three fields:pos: The new position after applying action and checking collisionsreward: The immediate reward for this transitiondone: Boolean flag (true only when goal is reached)
Example Transitions
Here are some example calls toenvStep:
Grid World Visualization
Here’s the 3×3 grid with coordinates:Deterministic vs Stochastic Environments
This implementation is deterministic—the same action in the same state always produces the same result. In stochastic environments, you might add:The
envStep function is pure—it doesn’t modify the input pos array, and has no side effects. This makes it easy to test and reason about. You can call it repeatedly with the same inputs and always get the same output.Extending the Environment
To create more complex environments, you could:- Add multiple walls: Check against an array of wall positions
- Variable rewards: Different cells give different rewards
- Moving obstacles: Walls that change position over time
- Larger grids: Increase
SIZEconstant - Diagonal movement: Add actions 4-7 for diagonal moves
- Terminal states: Multiple goal positions with different rewards