Overview
VQ-BeT (Vector-Quantized Behavior Transformer) is a novel policy architecture that learns to generate robot behaviors by predicting sequences of discrete action tokens in a learned latent space. It uses a two-stage training process: first learning a residual VQ-VAE to encode action sequences, then training a GPT-style transformer to predict these discrete codes. The policy was introduced in Behavior Generation with Latent Actions and achieves strong performance on manipulation tasks while being more sample-efficient than other approaches.Key Features
- Two-Stage Training: Separate VQ-VAE and GPT training phases
- Residual Vector Quantization: Multi-layer VQ for expressive action encoding
- Action Chunking with Tokens: Predicts multiple action tokens, each representing a chunk of actions
- Discrete Latent Space: Enables stable training and efficient exploration
- Offset Prediction: Fine-tunes discrete predictions with continuous offsets
- GPT Architecture: Autoregressive transformer for action token prediction
Architecture
VQ-BeT consists of two main components:1. Residual VQ-VAE (Stage 1)
- Encoder: Compresses action sequences into latent representations
- Residual VQ Layer: Multi-layer vector quantization with two codebooks (primary and secondary)
- Decoder: Reconstructs action sequences from quantized codes
- Purpose: Learn a discrete latent action space
2. VQ-BeT Policy (Stage 2)
- Vision Encoder: ResNet backbone with spatial softmax for image features
- State Projector: Projects proprioceptive state to GPT input dimension
- GPT Model: Autoregressive transformer that predicts action tokens
- Prediction Heads:
- Primary code prediction (first VQ layer)
- Secondary code prediction (second VQ layer)
- Offset prediction (continuous refinement)
Training
VQ-BeT requires two-stage training:Stage 1: VQ-VAE Training
The VQ-VAE is trained for the firstn_vqvae_training_steps (default: 20,000 steps).
Stage 2: Policy Training
After VQ-VAE training, the GPT model is trained to predict action codes.Basic Training Command
Training with Custom Configuration
Python API Training Example
Configuration Parameters
Input/Output Structure
Number of observation steps to pass to the policy.
Total number of action tokens to predict (current + future).
Number of action steps represented by each action token.
Vision Backbone
ResNet variant to use for image encoding.
(H, W) shape to crop images to. Must fit within image size.
Whether to use random crops during training (always center crop during eval).
Pretrained weights from torchvision. None means random initialization.
Replace batch normalization with group normalization in the backbone.
Number of keypoints for spatial softmax operation.
VQ-VAE Configuration
Number of optimization steps for training the Residual VQ-VAE (Stage 1).
Number of embedding vectors in each RVQ codebook layer.
Dimension of each embedding vector in the RVQ dictionary.
Hidden dimension size for VQ-VAE encoder/decoder.
Learning rate for VQ-VAE training (Stage 1).
Weight decay for VQ-VAE optimizer.
GPT Configuration
Maximum block size (context length) for the GPT model.
Input dimension for GPT (also used as observation feature dimension).
Output dimension for GPT (input to prediction heads).
Number of transformer layers in the GPT model.
Number of attention heads in the GPT model.
Hidden dimension size for GPT feed-forward layers.
Dropout rate for GPT model.
Loss Weights
Weight multiplier for the continuous offset prediction loss.
Weight multiplier for the primary VQ code prediction loss.
Weight multiplier for the secondary VQ code prediction loss.
Inference
Sampling temperature for code selection during inference. Lower = more deterministic.
Whether to select primary code first, then secondary (true), or sample both jointly (false).
Optimization
Learning rate for policy training (Stage 2).
Beta parameters for Adam optimizer.
Epsilon value for Adam optimizer.
Weight decay for policy optimizer.
Number of warmup steps for learning rate scheduler.
Normalization
Normalization mode for each feature type. Note: VQ-BeT uses IDENTITY for visual features. Default:
{"VISUAL": "IDENTITY", "STATE": "MIN_MAX", "ACTION": "MIN_MAX"}Usage Example
Loading a Pretrained Model
Inference with Temperature Control
Inference Loop with Observation Queue
Understanding VQ-BeT
Two-Stage Training Process
Stage 1 (Steps 0-20,000): Train VQ-VAE- Learn to encode action sequences into discrete codes
- Reconstruct actions from quantized representations
- Build a discrete latent action space
- Freeze VQ-VAE weights
- Train GPT to predict action codes from observations
- Learn offset predictions for fine-grained control
Key Concepts
Action Tokens: Each token represents a chunk ofaction_chunk_size continuous actions encoded as two discrete codes (primary + secondary) plus a continuous offset.
Residual VQ: Uses multiple VQ layers where each layer encodes the residual from previous layers, enabling more expressive representations.
Autoregressive Prediction: The GPT model predicts action tokens one at a time, conditioned on previous tokens and observations.
Advantages
- Sample Efficiency: Discrete latent space enables more stable learning
- Multimodal: Can represent multiple valid behaviors
- Hierarchical: Two-level VQ captures both coarse and fine-grained actions
- Interpretable: Discrete codes provide insight into learned behaviors
Training Tips
- Ensure VQ-VAE is well-trained before policy training begins (monitor reconstruction loss)
- Adjust
n_vqvae_training_stepsif VQ-VAE hasn’t converged by step 20,000 - Use larger
vqvae_n_embedfor more complex tasks - Tune
bet_softmax_temperaturefor the right exploration/exploitation balance - Balance the three loss weights based on task requirements
File Locations
Source files in the LeRobot repository:- Configuration:
src/lerobot/policies/vqbet/configuration_vqbet.py - Model:
src/lerobot/policies/vqbet/modeling_vqbet.py - Processor:
src/lerobot/policies/vqbet/processor_vqbet.py - Utilities:
src/lerobot/policies/vqbet/vqbet_utils.py