Overview
In RL, a policy learns to select actions that maximize expected future rewards through repeated interaction with the environment. This approach is particularly valuable when:- Expert demonstrations are difficult or expensive to collect
- The optimal solution is unknown
- The task requires exploration and discovery
- You want policies that can adapt and improve beyond human performance
How It Works
The RL Loop
Key Components
Policy: Neural network that maps observations to actions Reward Function: Scalar signal indicating action quality Replay Buffer: Stores past experiences for learning Value Function: Estimates expected future rewardsSupported Algorithms
SAC (Soft Actor-Critic)
SAC is an off-policy actor-critic algorithm that maximizes both reward and entropy, encouraging exploration:- Stable training through soft updates
- Maximum entropy objective for exploration
- Off-policy learning from replay buffer
- Works well with continuous action spaces
TDMPC (Temporal Difference Model Predictive Control)
TDMPC combines model-based RL with model predictive control:- Learns world model for planning
- Sample efficient compared to model-free RL
- Uses trajectory optimization
HIL-SERL (Human-in-the-Loop SERL)
HIL-SERL combines RL with human interventions for safe, efficient real-world learning:- Human interventions guide safe exploration
- Combines offline demos with online RL
- Reduces training time by 10x
- Safe for real robots
Reward Design
The reward function is critical for RL success. LeRobot supports several approaches:Hand-Crafted Rewards
Learned Reward Models
Train a classifier to predict rewards from observations:Human Feedback
Use human interventions as implicit rewards in HIL-SERL:Key Concepts
Exploration vs Exploitation
RL agents must balance exploring new behaviors with exploiting known good actions:Replay Buffer
Store and reuse past experiences for stable learning:Off-Policy vs On-Policy
Off-policy (SAC, TDMPC): Learn from any past experience- More sample efficient
- Can reuse old data
- Requires replay buffer
- More stable
- Simpler implementation
- Cannot reuse old data
Combining RL with Imitation Learning
Bootstrap RL training with demonstrations for faster learning:Advantages
- No Expert Required: Learns from environment feedback
- Discovers Solutions: Can find strategies humans might not consider
- Adaptive: Continues improving with more experience
- Optimal: Can exceed human performance
Limitations
- Sample Inefficient: Requires many environment interactions
- Reward Engineering: Designing good reward functions is challenging
- Unstable: Training can be sensitive to hyperparameters
- Safety: Random exploration can be dangerous on real robots
- Sim-to-Real Gap: Policies trained in simulation may not transfer
Best Practices
# Pre-train on demos
lerobot-train --policy.type=sac --dataset.repo_id=demos
# Fine-tune with RL
lerobot-train \
--policy.type=sac \
--policy.pretrained_path=outputs/checkpoint \
--use_online_training=true
Next Steps
- HIL-SERL Guide - Learn human-in-the-loop RL
- TDMPC Guide - Model-based RL
- Train Your First Policy - Hands-on training
- Imitation Learning - Compare with IL approaches