ε-Greedy Exploration

What is ε-Greedy?

ε-Greedy (epsilon-greedy) is a simple exploration strategy where the agent takes a random action with probability ε (epsilon), and follows its policy with probability (1 - ε). This introduces controlled randomness that prevents deterministic cycles and helps the agent explore alternative paths.

const epsilon = 0.1;  // 10% exploration rate

function selectAction(state, policy) {
  if (Math.random() < epsilon) {
    // Explore: random action
    return randomAction();
  } else {
    // Exploit: follow policy
    return policy(state);
  }
}

Think of ε-greedy as “trust your policy 90% of the time, but be willing to experiment 10% of the time.” This balance between exploitation and exploration is core to RL.

How It Works

The algorithm is dead simple:

1. Generate Random Number

On each step, roll a random number between 0 and 1:

const roll = Math.random();  // Returns value in [0, 1)

2. Compare Against Epsilon

If the roll is less than ε, explore randomly:

const epsilon = 0.1;

if (roll < epsilon) {
  // 10% of the time
  return exploreRandomly();
}

3. Otherwise, Exploit Policy

If the roll is ≥ ε, follow your learned policy:

else {
  // 90% of the time
  return policy(state);
}

The name “ε-greedy” comes from being greedy (following the policy) most of the time, with ε probability of exploration.

Complete Implementation

Here’s a full working example:

class EpsilonGreedyAgent {
  constructor(policy, epsilon = 0.1, allActions = [0, 1, 2, 3]) {
    this.policy = policy;
    this.epsilon = epsilon;
    this.allActions = allActions;
    this.explorations = 0;
    this.exploitations = 0;
  }

  selectAction(state) {
    if (Math.random() < this.epsilon) {
      // Explore: random action
      this.explorations++;
      const randomIndex = Math.floor(Math.random() * this.allActions.length);
      return this.allActions[randomIndex];
    } else {
      // Exploit: follow policy
      this.exploitations++;
      return this.policy(state);
    }
  }

  getStats() {
    const total = this.explorations + this.exploitations;
    return {
      explorations: this.explorations,
      exploitations: this.exploitations,
      actualEpsilon: this.explorations / total,
      expectedEpsilon: this.epsilon
    };
  }
}

// Usage
const agent = new EpsilonGreedyAgent(myPolicy, 0.1);

for (let step = 0; step < 100; step++) {
  const action = agent.selectAction(currentState);
  currentState = env.step(action);
}

console.log(agent.getStats());
// Output: { explorations: 9, exploitations: 91, actualEpsilon: 0.09, ... }

ε-Greedy vs Cycle Detection

These techniques both introduce exploration, but work differently:

Aspect	ε-Greedy	Cycle Detection
Trigger	Random (every step)	Conditional (when cycle detected)
Frequency	Constant (ε × steps)	Variable (only when stuck)
Randomness	Fully random action	Random alternative action
Memory	None	Requires state history
Phase	Training only	Training or deployment
Purpose	General exploration	Escape specific cycles

Key Difference

ε-greedy explores proactively whether needed or not, while cycle detection reacts to detected loops.

When ε-Greedy Would Help the Demo

If the demo used ε-greedy with ε=0.1 instead of cycle detection:

Agent reaches position (1,2)
Policy says: move left
90% chance: Follows policy → moves left → hits wall → stuck (repeats)
10% chance: Explores randomly → might try down → reaches goal!

Eventually, after many episodes, ε-greedy would stumble upon the correct action. But in a single episode, it might take 10+ tries at (1,2) before randomly choosing “down”.

Why Cycle Detection is Better for the Demo

Cycle detection is deterministic and targeted:

Agent reaches (1,2) and tries left → stuck
Agent returns to (1,2) second time → cycle detected
Forces exploration excluding left → guaranteed to try something else
Reaches goal in same episode

This is why the demo uses cycle detection — it guarantees escape in a single episode.

Choosing the Right Epsilon

The value of ε controls the exploration-exploitation trade-off:

Common Epsilon Values

// High exploration (early training)
const epsilon = 0.3;  // 30% random actions

// Balanced (mid training)
const epsilon = 0.1;  // 10% random actions - MOST COMMON

// Low exploration (late training)
const epsilon = 0.05; // 5% random actions

// Greedy (deployment)
const epsilon = 0.0;  // 0% random actions - pure exploitation

In the demo’s scenario, setting ε=0 (fully deterministic) would recreate the infinite loop problem! The agent would get stuck just like the left panel.

Epsilon Decay

In practice, you often want high exploration early (when the policy is bad) and low exploration later (when the policy is good). This is called epsilon decay:

class DecayingEpsilonGreedy {
  constructor(policy, epsilonStart = 1.0, epsilonEnd = 0.01, decaySteps = 1000) {
    this.policy = policy;
    this.epsilonStart = epsilonStart;
    this.epsilonEnd = epsilonEnd;
    this.decaySteps = decaySteps;
    this.currentStep = 0;
  }

  getCurrentEpsilon() {
    if (this.currentStep >= this.decaySteps) {
      return this.epsilonEnd;
    }

    // Linear decay
    const decayAmount = (this.epsilonStart - this.epsilonEnd) / this.decaySteps;
    return this.epsilonStart - (decayAmount * this.currentStep);
  }

  selectAction(state) {
    const epsilon = this.getCurrentEpsilon();
    this.currentStep++;

    if (Math.random() < epsilon) {
      return this.randomAction();
    } else {
      return this.policy(state);
    }
  }

  randomAction() {
    return Math.floor(Math.random() * 4);  // 0, 1, 2, or 3
  }
}

// Usage
const agent = new DecayingEpsilonGreedy(
  myPolicy,
  1.0,    // Start: 100% exploration
  0.01,   // End: 1% exploration
  10000   // Decay over 10,000 steps
);

// Step 0: epsilon = 1.0 (pure exploration)
// Step 5000: epsilon = 0.505 (half exploration)
// Step 10000+: epsilon = 0.01 (mostly exploitation)

Epsilon decay is standard in deep RL algorithms like DQN. Start with high exploration to discover good strategies, then gradually shift to exploitation as the policy improves.

Pros and Cons

✅ Advantages

Dead simple: 3-line implementation
No memory overhead: Stateless randomness
Universal: Works with any policy, any environment
Tunable: Adjust ε to control exploration amount
Prevents deterministic cycles: Adds stochasticity
Proven: Used in countless RL algorithms

❌ Disadvantages

Blind exploration: Random actions may be terrible
Wastes actions: Explores even when policy is good
No escape guarantee: Might not randomly find solution
Training only: Usually disabled in deployment
Inefficient: Explores states that are well-understood

When to Use ε-Greedy

✅ Good Fit

Training phase: When learning a policy from scratch
Small action spaces: Random actions are reasonable
Simple exploration: Don’t need sophisticated strategies
Continuous learning: Agent keeps improving over time
Benchmark algorithms: DQN, SARSA, Q-learning all use it

❌ Not Ideal

Deployment: Usually want deterministic behavior
Large action spaces: Random actions very unlikely to help
Safety-critical: Can’t afford random actions
Sparse rewards: Random exploration too inefficient
Cycle escape: Cycle detection is more targeted

Combining with Other Techniques

You can use ε-greedy alongside other safeguards:

function selectAction(state) {
  // Layer 1: Cycle detection (highest priority)
  if (cycleDetector.detectCycle(state)) {
    return cycleDetector.forceExploration(policy(state), allActions);
  }
  
  // Layer 2: ε-greedy exploration
  if (Math.random() < epsilon) {
    return randomAction();
  }
  
  // Layer 3: Follow policy
  return policy(state);
}

This combination gives you targeted escape (cycle detection) plus general exploration (ε-greedy), with guaranteed termination (max steps) as your final safeguard.

Enhanced Implementation with Logging

class LoggedEpsilonGreedy {
  constructor(policy, epsilon = 0.1) {
    this.policy = policy;
    this.epsilon = epsilon;
    this.history = [];
  }

  selectAction(state) {
    const roll = Math.random();
    const shouldExplore = roll < this.epsilon;
    
    let action, actionType;
    
    if (shouldExplore) {
      action = this.randomAction();
      actionType = 'explore';
    } else {
      action = this.policy(state);
      actionType = 'exploit';
    }

    // Log decision
    this.history.push({
      state,
      action,
      actionType,
      roll,
      epsilon: this.epsilon
    });

    return action;
  }

  randomAction() {
    return Math.floor(Math.random() * 4);
  }

  getExplorationRate() {
    const explorations = this.history.filter(h => h.actionType === 'explore').length;
    return explorations / this.history.length;
  }

  printReport() {
    console.log('\n=== ε-Greedy Report ===');
    console.log(`Total actions: ${this.history.length}`);
    console.log(`Expected ε: ${this.epsilon}`);
    console.log(`Actual exploration rate: ${this.getExplorationRate().toFixed(3)}`);
    
    console.log('\nAction breakdown:');
    this.history.forEach((h, i) => {
      const symbol = h.actionType === 'explore' ? '🎲' : '🎯';
      console.log(`  ${i}: ${symbol} ${h.actionType} → action ${h.action}`);
    });
  }
}

// Usage
const agent = new LoggedEpsilonGreedy(myPolicy, 0.2);

for (let i = 0; i < 20; i++) {
  const action = agent.selectAction(state);
  state = env.step(action);
}

agent.printReport();
// Output:
// === ε-Greedy Report ===
// Total actions: 20
// Expected ε: 0.2
// Actual exploration rate: 0.200
//
// Action breakdown:
//   0: 🎯 exploit → action 1
//   1: 🎲 explore → action 3
//   2: 🎯 exploit → action 2
//   ...

Real-World Example: DQN

Deep Q-Network (DQN), the famous algorithm that learned to play Atari games, uses ε-greedy exploration:

class DQNAgent {
  constructor(qNetwork) {
    this.qNetwork = qNetwork;
    
    // Decay from 100% exploration to 1% over 1M frames
    this.epsilonStart = 1.0;
    this.epsilonEnd = 0.01;
    this.epsilonDecay = 1000000;
    this.frameNumber = 0;
  }

  selectAction(state) {
    // Calculate current epsilon
    const epsilon = Math.max(
      this.epsilonEnd,
      this.epsilonStart - (this.frameNumber / this.epsilonDecay)
    );

    this.frameNumber++;

    // ε-greedy action selection
    if (Math.random() < epsilon) {
      // Explore: random action
      return Math.floor(Math.random() * this.numActions);
    } else {
      // Exploit: action with highest Q-value
      const qValues = this.qNetwork.predict(state);
      return this.argmax(qValues);
    }
  }

  argmax(array) {
    return array.indexOf(Math.max(...array));
  }
}

DQN’s success in mastering Atari games demonstrated that ε-greedy, despite its simplicity, is sufficient for learning complex behaviors when combined with function approximation (neural networks).

Interactive Demo Comparison

The RL Cycle Demo doesn’t use ε-greedy — it uses cycle detection instead. But you can imagine how ε-greedy would perform:

With ε-Greedy

Agent might escape loop randomly
Requires multiple attempts/episodes
Success depends on ε value
No memory overhead

With Cycle Detection

Agent escapes loop deterministically
Succeeds in single episode
Guaranteed to try alternative
Requires state history

For the demo’s goal of “show cycle escape in one episode,” cycle detection is superior. But for long-term learning across many episodes, ε-greedy would be more practical.

Key Takeaways

ε-greedy adds controlled randomness to prevent deterministic cycles
Takes random action with probability ε, follows policy with probability 1-ε
Common ε values: 0.1 (10%) for balanced exploration
Use epsilon decay to shift from exploration to exploitation over time
Best for training phase, often disabled in deployment
Simpler than cycle detection but less targeted
Core component of classic RL algorithms (DQN, SARSA, Q-learning)

Explore the Demo

See how targeted cycle detection compares to what random ε-greedy exploration would achieve

Overview

Concepts

Demo Guide

Solutions

Implementation

Context

What is ε-Greedy?

How It Works

1. Generate Random Number

2. Compare Against Epsilon

3. Otherwise, Exploit Policy

Complete Implementation

ε-Greedy vs Cycle Detection

Key Difference

Choosing the Right Epsilon

Common Epsilon Values

Epsilon Decay

Pros and Cons

When to Use ε-Greedy

✅ Good Fit

❌ Not Ideal

Combining with Other Techniques

Enhanced Implementation with Logging

Real-World Example: DQN

Interactive Demo Comparison

With ε-Greedy

With Cycle Detection

Key Takeaways

Explore the Demo

Build docs developers (and LLMs) love

Overview

Concepts

Demo Guide

Solutions

Implementation

Context

​What is ε-Greedy?

​How It Works

​1. Generate Random Number

​2. Compare Against Epsilon

​3. Otherwise, Exploit Policy

​Complete Implementation

​ε-Greedy vs Cycle Detection

​Key Difference

​Choosing the Right Epsilon

​Common Epsilon Values

​Epsilon Decay

​Pros and Cons

​When to Use ε-Greedy

​✅ Good Fit

​❌ Not Ideal

​Combining with Other Techniques

​Enhanced Implementation with Logging

​Real-World Example: DQN

​Interactive Demo Comparison

With ε-Greedy

With Cycle Detection

​Key Takeaways

Explore the Demo

Build docs developers (and LLMs) love

What is ε-Greedy?

How It Works

1. Generate Random Number

2. Compare Against Epsilon

3. Otherwise, Exploit Policy

Complete Implementation

ε-Greedy vs Cycle Detection

Key Difference

Choosing the Right Epsilon

Common Epsilon Values

Epsilon Decay

Pros and Cons

When to Use ε-Greedy

✅ Good Fit

❌ Not Ideal

Combining with Other Techniques

Enhanced Implementation with Logging

Real-World Example: DQN

Interactive Demo Comparison

Key Takeaways