What is ε-Greedy?
ε-Greedy (epsilon-greedy) is a simple exploration strategy where the agent takes a random action with probability ε (epsilon), and follows its policy with probability (1 - ε). This introduces controlled randomness that prevents deterministic cycles and helps the agent explore alternative paths.How It Works
The algorithm is dead simple:1. Generate Random Number
On each step, roll a random number between 0 and 1:2. Compare Against Epsilon
If the roll is less than ε, explore randomly:3. Otherwise, Exploit Policy
If the roll is ≥ ε, follow your learned policy:The name “ε-greedy” comes from being greedy (following the policy) most of the time, with ε probability of exploration.
Complete Implementation
Here’s a full working example:ε-Greedy vs Cycle Detection
These techniques both introduce exploration, but work differently:| Aspect | ε-Greedy | Cycle Detection |
|---|---|---|
| Trigger | Random (every step) | Conditional (when cycle detected) |
| Frequency | Constant (ε × steps) | Variable (only when stuck) |
| Randomness | Fully random action | Random alternative action |
| Memory | None | Requires state history |
| Phase | Training only | Training or deployment |
| Purpose | General exploration | Escape specific cycles |
Key Difference
ε-greedy explores proactively whether needed or not, while cycle detection reacts to detected loops.When ε-Greedy Would Help the Demo
When ε-Greedy Would Help the Demo
If the demo used ε-greedy with ε=0.1 instead of cycle detection:
- Agent reaches position (1,2)
- Policy says: move left
- 90% chance: Follows policy → moves left → hits wall → stuck (repeats)
- 10% chance: Explores randomly → might try down → reaches goal!
Why Cycle Detection is Better for the Demo
Why Cycle Detection is Better for the Demo
Cycle detection is deterministic and targeted:
- Agent reaches (1,2) and tries left → stuck
- Agent returns to (1,2) second time → cycle detected
- Forces exploration excluding left → guaranteed to try something else
- Reaches goal in same episode
Choosing the Right Epsilon
The value of ε controls the exploration-exploitation trade-off:Common Epsilon Values
Epsilon Decay
In practice, you often want high exploration early (when the policy is bad) and low exploration later (when the policy is good). This is called epsilon decay:Pros and Cons
✅ Advantages
✅ Advantages
- Dead simple: 3-line implementation
- No memory overhead: Stateless randomness
- Universal: Works with any policy, any environment
- Tunable: Adjust ε to control exploration amount
- Prevents deterministic cycles: Adds stochasticity
- Proven: Used in countless RL algorithms
❌ Disadvantages
❌ Disadvantages
- Blind exploration: Random actions may be terrible
- Wastes actions: Explores even when policy is good
- No escape guarantee: Might not randomly find solution
- Training only: Usually disabled in deployment
- Inefficient: Explores states that are well-understood
When to Use ε-Greedy
✅ Good Fit
- Training phase: When learning a policy from scratch
- Small action spaces: Random actions are reasonable
- Simple exploration: Don’t need sophisticated strategies
- Continuous learning: Agent keeps improving over time
- Benchmark algorithms: DQN, SARSA, Q-learning all use it
❌ Not Ideal
- Deployment: Usually want deterministic behavior
- Large action spaces: Random actions very unlikely to help
- Safety-critical: Can’t afford random actions
- Sparse rewards: Random exploration too inefficient
- Cycle escape: Cycle detection is more targeted
Combining with Other Techniques
You can use ε-greedy alongside other safeguards:This combination gives you targeted escape (cycle detection) plus general exploration (ε-greedy), with guaranteed termination (max steps) as your final safeguard.
Enhanced Implementation with Logging
Real-World Example: DQN
Deep Q-Network (DQN), the famous algorithm that learned to play Atari games, uses ε-greedy exploration:DQN’s success in mastering Atari games demonstrated that ε-greedy, despite its simplicity, is sufficient for learning complex behaviors when combined with function approximation (neural networks).
Interactive Demo Comparison
The RL Cycle Demo doesn’t use ε-greedy — it uses cycle detection instead. But you can imagine how ε-greedy would perform:With ε-Greedy
- Agent might escape loop randomly
- Requires multiple attempts/episodes
- Success depends on ε value
- No memory overhead
With Cycle Detection
- Agent escapes loop deterministically
- Succeeds in single episode
- Guaranteed to try alternative
- Requires state history
Key Takeaways
- ε-greedy adds controlled randomness to prevent deterministic cycles
- Takes random action with probability ε, follows policy with probability 1-ε
- Common ε values: 0.1 (10%) for balanced exploration
- Use epsilon decay to shift from exploration to exploitation over time
- Best for training phase, often disabled in deployment
- Simpler than cycle detection but less targeted
- Core component of classic RL algorithms (DQN, SARSA, Q-learning)
Explore the Demo
See how targeted cycle detection compares to what random ε-greedy exploration would achieve