HIL-SERL
HIL-SERL (Human-in-the-Loop Sample-Efficient Reinforcement Learning) is a state-of-the-art reinforcement learning algorithm designed for training robot policies on real hardware with minimal human demonstrations and interventions.Overview
HIL-SERL represents a breakthrough in real-world robot learning by combining the best of both imitation learning and reinforcement learning. Unlike traditional RL methods that require thousands of episodes, HIL-SERL achieves near-perfect task success in just a few hours of training on real robots.Key Features
- Sample Efficient: Train policies with as few as 15 human demonstrations
- Human-in-the-Loop: Humans can intervene during training to guide exploration and correct unsafe behaviors
- Real Robot Training: Designed specifically for real-world hardware, not just simulation
- Actor-Learner Architecture: Distributed training separates robot interaction from policy updates
- Safety-First: Built-in workspace bounds, joint limits, and human oversight
Architecture
HIL-SERL combines three key components:1. Offline Demonstrations & Reward Classifier
The system starts with a small set of human teleoperation demonstrations and trains a vision-based reward classifier. This gives the policy a shaped starting point and enables automated success detection.2. Distributed Actor-Learner with SAC
HIL-SERL uses a distributed Soft Actor-Critic (SAC) architecture:- Learner Process: Runs on GPU, performs gradient updates on the policy
- Actor Process: Runs on the robot, executes policy and collects experience
- Communication: gRPC protocol for efficient policy parameter updates
3. Human Interventions
During training, humans can take control at any time using a gamepad or leader arm. These interventions:- Correct dangerous or unproductive behaviors
- Guide exploration towards promising regions
- Provide implicit reward signals
- Improve sample efficiency dramatically
Installation
To use HIL-SERL, install LeRobot with the HIL-SERL extras:Workflow Overview
The complete HIL-SERL workflow consists of several stages:- Find Workspace Bounds: Use
lerobot-find-joint-limitsto determine safe operational bounds - Collect Demonstrations: Record 10-20 human demonstrations of the task
- Process Dataset: Crop images to relevant regions of interest
- Train Reward Classifier (Optional): Train a vision-based success detector
- RL Training: Run distributed actor-learner training with human interventions
Configuration
HIL-SERL uses nested configuration classes to organize environment and training settings:- Control Mode:
gamepad,leader, orkeyboardfor human control - Inverse Kinematics: End-effector control with workspace bounds
- Image Processing: Crop and resize parameters for efficient visual learning
- Reward Classifier: Pretrained model path and success threshold
- Reset Configuration: Episode duration, reset positions, and timing
Training a Policy
Step 1: Start the Learner
The learner process handles all gradient computation and policy updates:- Initializes the SAC policy network
- Prepares replay buffers with offline demonstrations
- Opens a gRPC server to communicate with actors
- Performs policy updates based on collected transitions
Step 2: Start the Actor
In a separate terminal, start the actor process:- Connects to the learner via gRPC
- Initializes the robot environment
- Executes policy rollouts to collect experience
- Sends transitions to the learner
- Receives updated policy parameters
Step 3: Provide Human Interventions
During training:- Use the gamepad’s upper right trigger (or spacebar) to take control
- Guide the robot when it’s stuck or behaving incorrectly
- Allow the policy to explore on its own most of the time
- Gradually reduce interventions as the policy improves
Processor Pipeline
HIL-SERL uses a modular processor pipeline to handle observations and actions:Environment Processor Steps
- VanillaObservationProcessorStep: Standardizes robot observations
- JointVelocityProcessorStep: Adds joint velocity information (optional)
- MotorCurrentProcessorStep: Adds motor current readings (optional)
- ForwardKinematicsJointsToEE: Computes end-effector pose (optional)
- ImageCropResizeProcessorStep: Crops and resizes camera images
- TimeLimitProcessorStep: Enforces episode time limits
- GripperPenaltyProcessorStep: Applies gripper usage penalties (optional)
- RewardClassifierProcessorStep: Automated reward detection (optional)
- AddBatchDimensionProcessorStep: Prepares data for neural networks
- DeviceProcessorStep: Moves data to GPU/CPU
Action Processor Steps
- AddTeleopActionAsComplimentaryDataStep: Logs teleoperator actions
- AddTeleopEventsAsInfoStep: Records intervention events
- InterventionActionProcessorStep: Handles human interventions
- Inverse Kinematics Pipeline: Converts end-effector commands to joint targets
Training the Reward Classifier
The reward classifier is an optional but powerful component that automates success detection:- Uses a pretrained vision model (e.g., ResNet-10)
- Predicts binary success/failure from camera images
- Provides automated rewards during RL training
- Enables automatic episode termination on success
Reward Classifier Configuration
Key Hyperparameters
Critical parameters that significantly impact training:Policy Parameters (SAC)
temperature_init(default:1e-2): Controls exploration. Higher values encourage more exploration, lower values make the policy more deterministicstorage_device(default:"cpu"): Set to"cuda"if you have spare GPU memory for faster trainingpolicy_parameters_push_frequency(default:4s): How often to send updated weights from learner to actor. Decrease to 1-2s for fresher weights
Environment Parameters
fps(default:10): Control frequency in Hzcontrol_time_s(default:20.0): Maximum episode durationreset_time_s(default:5.0): Time to wait during reset
Image Processing
crop_params_dict: Crop parameters for each camera (determined usingcrop_dataset_roi.py)resize_size: Target image size, typically[128, 128]or[64, 64]
Tips for Successful Training
Intervention Strategy
- Early Training: Allow the policy to explore for the first few episodes
- Middle Training: Intervene to correct dangerous or unproductive behaviors
- Late Training: Provide minimal interventions, only for critical corrections
- Goal: Intervention rate should decrease over time as the policy improves
Data Quality
- Collect demonstrations that cover the full workspace
- Ensure demonstrations are relatively consistent
- Crop images to exclude irrelevant background
- Use workspace bounds to limit exploration space
Monitoring Training
Enable Weights & Biases to monitor:- Intervention rate: Should decrease over time
- Episode reward: Should increase over time
- Success rate: Should approach 100%
- Q-values: Should stabilize after initial volatility
Example Configuration Files
Complete example configurations are available:Supported Robots
HIL-SERL has been successfully used with:- SO100/SO101: Compact robotic arms for desktop manipulation
- Bimanual setups: Two-arm systems for complex tasks
- Custom robots: Any robot with a follower arm and cameras can be used
Performance Results
HIL-SERL achieves:- Near-perfect success rates (>95%) on manipulation tasks
- Faster learning than imitation-only baselines
- Sample efficiency: 2-4 hours of real robot time for complex tasks
- Improved cycle times: Faster task execution than human demonstrations
Citation
See Also
- Imitation Learning - Learn about IL approaches
- Processor Concepts - Understanding data processing
- Reinforcement Learning Tutorial - RL guide