π₀-FAST (Pi0-FAST)
π₀-FAST is a Vision-Language-Action model for general robot control that uses autoregressive next-token prediction to model continuous robot actions.Model Overview
π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called FAST (Frequency-space Action Sequence Tokenization). This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training up to 5x faster than diffusion-based approaches like π₀.
Why FAST?
Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control. FAST solves this by compressing action sequences using signal processing techniques, resulting in a dense sequence of action tokens that can be predicted autoregressively—just like language tokens.How FAST Tokenization Works
The FAST tokenizer compresses action sequences through the following steps:-
Normalize: Take a continuous action chunk of shape
(H, D)whereHis the horizon andDis the action dimension. Normalize using one of the supported normalization methods (Quantiles recommended to handle outliers). - Discrete Cosine Transform (DCT): Apply DCT (via scipy) to each action dimension separately. DCT is a compression algorithm commonly used in image and audio codecs (JPEG, MP3).
- Quantization: Round and remove insignificant coefficients for each action dimension, producing a sparse frequency matrix.
- Flatten: Flatten the matrix into a 1D vector, with low-frequency components first.
- Byte Pair Encoding (BPE): Train a BPE tokenizer to compress the DCT coefficients into dense action tokens, typically achieving 10x compression over prior tokenization approaches.
Installation Requirements
- Install LeRobot by following our Installation Guide.
-
Install π₀-FAST dependencies by running:
Training a Custom FAST Tokenizer
You have two options for the FAST tokenizer:-
Use the pre-trained tokenizer: The
lerobot/fast-action-tokenizertokenizer was trained on 1M+ real robot action sequences and works as a general-purpose tokenizer. - Train your own tokenizer: For maximum performance on your specific dataset, you can finetune the tokenizer on your own data.
Training Your Own Tokenizer
Key Tokenizer Parameters
| Parameter | Description | Default |
|---|---|---|
--repo_id | LeRobot dataset repository ID | Required |
--action_horizon | Number of future actions in each chunk | 10 |
--encoded_dims | Comma-separated dimension ranges to encode (e.g., "0:6,7:23") | "0:6,7:23" |
--vocab_size | BPE vocabulary size | 1024 |
--scale | DCT scaling factor for quantization | 10.0 |
--normalization_mode | Normalization mode (MEAN_STD, MIN_MAX, QUANTILES, QUANTILE10, IDENTITY) | QUANTILES |
--sample_fraction | Fraction of chunks to sample per episode | 0.1 |
Usage
To use π₀-FAST in LeRobot, specify the policy type as:Training
For training π₀-FAST, you can use the LeRobot training script:Key Training Parameters
| Parameter | Description | Default |
|---|---|---|
--policy.gradient_checkpointing=true | Reduces memory usage significantly during training | false |
--policy.dtype=bfloat16 | Use mixed precision training for efficiency | float32 |
--policy.chunk_size | Number of action steps to predict (action horizon) | 50 |
--policy.n_action_steps | Number of action steps to execute | 50 |
--policy.max_action_tokens | Maximum number of FAST tokens per action chunk | 256 |
--policy.action_tokenizer_name | FAST tokenizer to use | lerobot/fast-action-tokenizer |
--policy.compile_model=true | Enable torch.compile for faster training | false |
Inference
KV-Caching for Fast Inference
π₀-FAST supports KV-caching, a widely used optimization in LLM inference. This caches the key-value pairs from the attention mechanism, avoiding redundant computation during autoregressive decoding.Inference Example
Model Architecture
π₀-FAST uses a PaliGemma-based architecture:- Vision Encoder: SigLIP vision tower for image understanding
- Language Model: Gemma 2B for processing language instructions and predicting action tokens
Configuration Options
| Parameter | Description | Default |
|---|---|---|
paligemma_variant | VLM backbone variant (gemma_300m, gemma_2b) | gemma_2b |
max_state_dim | Maximum state vector dimension (padded) | 32 |
max_action_dim | Maximum action vector dimension (padded) | 32 |
temperature | Sampling temperature (0.0 for greedy) | 0.0 |
max_decoding_steps | Maximum decoding steps | 256 |
use_kv_cache | Enable KV caching for faster inference | true |
Comparison with π₀
| Feature | π₀ | π₀-FAST |
|---|---|---|
| Action Representation | Flow Matching (Diffusion) | Autoregressive Tokens (FAST) |
| Training Speed | 1x | 5x faster |
| Dexterity | High | High |
| Inference Method | Iterative Denoising | Autoregressive Decoding |
| KV-Caching | N/A | Supported |
Reproducing π₀Fast results
We reproduce the results of π₀Fast on the LIBERO benchmark using the LeRobot implementation. We take the LeRobot PiFast base model lerobot/pi0fast-base and finetune for an additional 40kk steps in bfloat16, with batch size of 256 on 8 H100 GPUs using the HuggingFace LIBERO dataset. The finetuned model can be found here:- π₀Fast LIBERO: lerobot/pi0fast-libero
n_action_steps=10, similar to the original OpenPI implementation.
Results
We obtain the following results on the LIBERO benchmark:| Model | LIBERO Spatial | LIBERO Object | LIBERO Goal | LIBERO 10 | Average |
|---|---|---|---|---|---|
| π₀-fast | 70.0 | 100.0 | 100.0 | 60.0 | 82.5 |
License
This model follows the Apache 2.0 License, consistent with the original OpenPI repository.References
- FAST: Efficient Robot Action Tokenization - Physical Intelligence Blog
- OpenPI Repository - Original implementation
- FAST Tokenizer on Hugging Face - Pre-trained tokenizer