These techniques don’t directly prevent cycles like max steps does, but they discourage cycles through the reward structure, making loops less attractive to the agent during learning.
These are training techniques that shape the agent’s learning process, unlike max steps (termination) or cycle detection (forced exploration) which actively intervene during execution.
The RL Cycle Demo uses a step penalty of -0.1 per action:
// From index.html:668-684function envStep(pos, action) { let [r, c] = pos; if (action === 0) r--; // up else if (action === 1) c++; // right else if (action === 2) r++; // down else if (action === 3) c--; // left // Clip to grid bounds r = Math.max(0, Math.min(SIZE - 1, r)); c = Math.max(0, Math.min(SIZE - 1, c)); // Wall collision if (r === WALL[0] && c === WALL[1]) { r = pos[0]; c = pos[1]; } const done = (r === GOAL[0] && c === GOAL[1]); const reward = done ? 10 : -0.1; // ← Step penalty here return { pos: [r, c], reward, done };}
The demo’s reward structure:
Reach goal: +10.0
Each step: -0.1
Net reward for optimal 5-step path: +10.0 - (5 × 0.1) = +9.5
Net reward for 30-step loop: +0.0 - (30 × 0.1) = -3.0
In the demo’s left panel, the agent accumulates negative reward as it loops. After 30 steps stuck at (1,2), the total reward is approximately -3.0 — a clear signal that this behavior is bad.
Curiosity-driven exploration gives the agent an intrinsic reward for visiting novel or rare states, encouraging systematic exploration and discouraging repetitive behavior.
class CuriosityModule { constructor(noveltyBonus = 1.0) { this.stateVisitCounts = new Map(); this.noveltyBonus = noveltyBonus; } getIntrinsicReward(state) { const stateKey = JSON.stringify(state); const visitCount = this.stateVisitCounts.get(stateKey) || 0; // More visits → less intrinsic reward const intrinsicReward = this.noveltyBonus / Math.sqrt(visitCount + 1); // Record visit this.stateVisitCounts.set(stateKey, visitCount + 1); return intrinsicReward; } getCombinedReward(extrinsicReward, state) { const intrinsicReward = this.getIntrinsicReward(state); return extrinsicReward + intrinsicReward; }}// Usageconst curiosity = new CuriosityModule(0.5);const result = env.step(action);const totalReward = curiosity.getCombinedReward(result.reward, result.state);// First visit to state: intrinsicReward = 0.5 / sqrt(1) = 0.5// Second visit: intrinsicReward = 0.5 / sqrt(2) = 0.35// Third visit: intrinsicReward = 0.5 / sqrt(3) = 0.29// ...
First time agent visits state (1,2): High intrinsic reward
Second time at (1,2): Lower intrinsic reward
Third time at (1,2): Even lower
Tenth time at (1,2): Negligible intrinsic reward
The diminishing returns discourage the agent from revisiting the same states repeatedly, naturally pushing it away from cycles.
Curiosity-driven exploration is related to cycle detection, but works through reward shaping during training rather than forced exploration during execution.
State-of-the-art curiosity methods use neural networks to predict state transitions. The agent receives intrinsic reward based on prediction error — unpredictable states are considered novel:
// Conceptual implementation (real version uses neural networks)class ICM { constructor(forwardModel) { this.forwardModel = forwardModel; // Neural net that predicts next state } getIntrinsicReward(state, action, nextState) { // Predict what next state should be const predictedNextState = this.forwardModel.predict(state, action); // Calculate prediction error const predictionError = this.computeError(predictedNextState, nextState); // High error = novel/surprising = high intrinsic reward return predictionError; } computeError(predicted, actual) { // Simplified: would use neural net loss in practice return Math.abs(predicted - actual); } train(state, action, nextState) { // Update forward model to predict transitions better this.forwardModel.fit(state, action, nextState); }}
Discount factor doesn’t prevent loops — it just makes long episodes less valuable. The agent can still get stuck in an infinite loop if the policy is bad.
The RL Cycle Demo uses max steps (termination) + cycle detection (active escape) + step penalty (reward shaping). This combination provides both guaranteed termination and active loop escape.
Creates effective planning horizon (~44 steps for γ=0.9)
Standard RL technique, doesn’t specifically target cycles
Simple parameter, but indirect effect on loops
Use max steps + step penalty as your baseline. Add cycle detection for active escape in deployment. Consider curiosity for training in complex environments.
View Full Demo Source
See the complete implementation including step penalty (-0.1) and max steps (30) at index.html:668-684