Overview
Diffusion Policy is a state-of-the-art visuomotor policy that formulates robot action generation as a conditional diffusion process. It learns to denoise random action sequences into coherent behaviors, enabling it to capture multimodal action distributions and generate smooth, temporally consistent trajectories. The policy was introduced in Diffusion Policy: Visuomotor Policy Learning via Action Diffusion and has shown excellent performance across various manipulation tasks.Key Features
- Diffusion-based Action Generation: Uses iterative denoising to generate high-quality action sequences
- Multimodal Learning: Naturally handles multiple valid solutions for a given observation
- Temporal Consistency: Predicts action horizons with receding horizon control
- Vision Backbone: ResNet with group normalization and spatial softmax
- Flexible Architecture: Configurable U-Net for diffusion modeling
- Multiple Schedulers: Support for DDPM and DDIM sampling
Architecture
The Diffusion Policy consists of:- Vision Encoder: ResNet backbone with group normalization and spatial softmax for extracting visual features
- Observation Encoder: Processes both visual and proprioceptive state information
- U-Net Architecture: Conditional diffusion model with FiLM conditioning
- Temporal downsampling stages (default: 512, 1024, 2048)
- Diffusion timestep embedding
- FiLM-based conditioning on observations
- Noise Scheduler: DDPM or DDIM scheduler for forward/reverse diffusion
Training
Basic Training Command
Training with Custom Configuration
Python API Training Example
Configuration Parameters
Input/Output Structure
Number of observation steps to pass to the policy (current + historical observations).
Diffusion model action prediction horizon. Must be divisible by 2^(number of downsampling stages).
Number of action steps to execute per policy invocation (receding horizon control).
Number of last frames to skip during training to avoid excessive padding.
Vision Backbone
ResNet variant to use for image encoding.
(H, W) shape to resize images to. If None, uses original resolution.
Ratio for deriving crop size from resize_shape. Set to 1.0 to disable cropping.
(H, W) shape to crop images to. Computed automatically when resize_shape and crop_ratio are set.
Whether to use random crops during training (always center crop during eval).
Pretrained weights from torchvision. None means random initialization.
Replace batch normalization with group normalization in the backbone.
Number of keypoints for spatial softmax operation.
Whether to use separate RGB encoders for each camera view.
U-Net Architecture
Feature dimensions for each temporal downsampling stage in the U-Net.
Convolutional kernel size in the U-Net.
Number of groups for group normalization in U-Net conv blocks.
Embedding dimension for diffusion timestep conditioning.
Whether to use both scale and bias in FiLM conditioning (bias only if false).
Noise Scheduler
Type of noise scheduler to use. Options: “DDPM”, “DDIM”.
Number of diffusion steps for forward diffusion during training.
Beta schedule for diffusion process.
Starting beta value for diffusion schedule.
Ending beta value for diffusion schedule.
Type of prediction the U-Net makes. Options: “epsilon” (noise), “sample” (direct).
Whether to clip samples during denoising.
Range for clipping samples: [-clip_sample_range, +clip_sample_range].
Inference
Number of denoising steps during inference. Defaults to num_train_timesteps if not set.
Optimization
Learning rate for the optimizer.
Beta parameters for Adam optimizer.
Epsilon value for Adam optimizer.
Weight decay for the optimizer.
Learning rate scheduler type.
Number of warmup steps for learning rate scheduler.
Whether to compile the model with torch.compile for faster training.
Normalization
Normalization mode for each feature type. Default:
{"VISUAL": "MEAN_STD", "STATE": "MIN_MAX", "ACTION": "MIN_MAX"}Usage Example
Loading a Pretrained Model
Fast Inference with DDIM
Inference Loop with Observation Queue
Understanding Diffusion Policy
How It Works
- Training: The model learns to denoise Gaussian noise into valid action sequences, conditioned on observations
- Inference: Starting from random noise, the model iteratively denoises to generate smooth action trajectories
- Receding Horizon: Only the first
n_action_stepsof the predictedhorizonare executed, then a new prediction is made
Key Advantages
- Multimodal: Can represent multiple valid action distributions
- Smooth Trajectories: Diffusion process naturally produces temporally coherent actions
- Flexible: Can be adapted to different action spaces and observation modalities
Training Tips
- Start with DDPM for training, then switch to DDIM for faster inference
- Increase
horizonfor tasks requiring longer-term planning - Adjust
num_inference_stepsto trade off speed vs. quality - Use
compile_model=truewith PyTorch 2.0+ for significant speedup
File Locations
Source files in the LeRobot repository:- Configuration:
src/lerobot/policies/diffusion/configuration_diffusion.py - Model:
src/lerobot/policies/diffusion/modeling_diffusion.py - Processor:
src/lerobot/policies/diffusion/processor_diffusion.py - Examples:
examples/tutorial/diffusion/