Skip to main content

π₀-FAST (Pi0-FAST)

π₀-FAST is a Vision-Language-Action model for general robot control that uses autoregressive next-token prediction to model continuous robot actions.

Model Overview

π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called FAST (Frequency-space Action Sequence Tokenization). This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training up to 5x faster than diffusion-based approaches like π₀. An overview of Pi0-FAST

Why FAST?

Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control. FAST solves this by compressing action sequences using signal processing techniques, resulting in a dense sequence of action tokens that can be predicted autoregressively—just like language tokens.

How FAST Tokenization Works

The FAST tokenizer compresses action sequences through the following steps:
  1. Normalize: Take a continuous action chunk of shape (H, D) where H is the horizon and D is the action dimension. Normalize using one of the supported normalization methods (Quantiles recommended to handle outliers).
  2. Discrete Cosine Transform (DCT): Apply DCT (via scipy) to each action dimension separately. DCT is a compression algorithm commonly used in image and audio codecs (JPEG, MP3).
  3. Quantization: Round and remove insignificant coefficients for each action dimension, producing a sparse frequency matrix.
  4. Flatten: Flatten the matrix into a 1D vector, with low-frequency components first.
  5. Byte Pair Encoding (BPE): Train a BPE tokenizer to compress the DCT coefficients into dense action tokens, typically achieving 10x compression over prior tokenization approaches.
This approach can transform any existing VLM into a VLA by training it to predict these FAST tokens.

Installation Requirements

  1. Install LeRobot by following our Installation Guide.
  2. Install π₀-FAST dependencies by running:
    pip install -e ".[pi]"
    

Training a Custom FAST Tokenizer

You have two options for the FAST tokenizer:
  1. Use the pre-trained tokenizer: The lerobot/fast-action-tokenizer tokenizer was trained on 1M+ real robot action sequences and works as a general-purpose tokenizer.
  2. Train your own tokenizer: For maximum performance on your specific dataset, you can finetune the tokenizer on your own data.

Training Your Own Tokenizer

lerobot-train-tokenizer \
    --repo_id "user/my-lerobot-dataset" \
    --action_horizon 10 \
    --encoded_dims "0:6" \
    --vocab_size 1024 \
    --scale 10.0 \
    --normalization_mode QUANTILES \
    --output_dir "./my_fast_tokenizer" \
    --push_to_hub \
    --hub_repo_id "username/my-action-tokenizer"

Key Tokenizer Parameters

ParameterDescriptionDefault
--repo_idLeRobot dataset repository IDRequired
--action_horizonNumber of future actions in each chunk10
--encoded_dimsComma-separated dimension ranges to encode (e.g., "0:6,7:23")"0:6,7:23"
--vocab_sizeBPE vocabulary size1024
--scaleDCT scaling factor for quantization10.0
--normalization_modeNormalization mode (MEAN_STD, MIN_MAX, QUANTILES, QUANTILE10, IDENTITY)QUANTILES
--sample_fractionFraction of chunks to sample per episode0.1

Usage

To use π₀-FAST in LeRobot, specify the policy type as:
policy.type=pi0_fast

Training

For training π₀-FAST, you can use the LeRobot training script:
lerobot-train \
    --dataset.repo_id=your_dataset \
    --policy.type=pi0_fast \
    --output_dir=./outputs/pi0fast_training \
    --job_name=pi0fast_training \
    --policy.pretrained_path=lerobot/pi0_fast_base \
    --policy.dtype=bfloat16 \
    --policy.gradient_checkpointing=true \
    --policy.chunk_size=10 \
    --policy.n_action_steps=10 \
    --policy.max_action_tokens=256 \
    --steps=100000 \
    --batch_size=4 \
    --policy.device=cuda

Key Training Parameters

ParameterDescriptionDefault
--policy.gradient_checkpointing=trueReduces memory usage significantly during trainingfalse
--policy.dtype=bfloat16Use mixed precision training for efficiencyfloat32
--policy.chunk_sizeNumber of action steps to predict (action horizon)50
--policy.n_action_stepsNumber of action steps to execute50
--policy.max_action_tokensMaximum number of FAST tokens per action chunk256
--policy.action_tokenizer_nameFAST tokenizer to uselerobot/fast-action-tokenizer
--policy.compile_model=trueEnable torch.compile for faster trainingfalse

Inference

KV-Caching for Fast Inference

π₀-FAST supports KV-caching, a widely used optimization in LLM inference. This caches the key-value pairs from the attention mechanism, avoiding redundant computation during autoregressive decoding.
# KV-caching is enabled by default
policy.use_kv_cache=true

Inference Example

from lerobot.policies.pi0_fast import PI0FastPolicy, PI0FastConfig

# Load the policy
policy = PI0FastPolicy.from_pretrained("your-model-path")

# During inference
actions = policy.predict_action_chunk(batch)

Model Architecture

π₀-FAST uses a PaliGemma-based architecture:
  • Vision Encoder: SigLIP vision tower for image understanding
  • Language Model: Gemma 2B for processing language instructions and predicting action tokens
The model takes images, text instructions, and robot state as input, and outputs discrete FAST tokens that are decoded back to continuous actions.

Configuration Options

ParameterDescriptionDefault
paligemma_variantVLM backbone variant (gemma_300m, gemma_2b)gemma_2b
max_state_dimMaximum state vector dimension (padded)32
max_action_dimMaximum action vector dimension (padded)32
temperatureSampling temperature (0.0 for greedy)0.0
max_decoding_stepsMaximum decoding steps256
use_kv_cacheEnable KV caching for faster inferencetrue

Comparison with π₀

Featureπ₀π₀-FAST
Action RepresentationFlow Matching (Diffusion)Autoregressive Tokens (FAST)
Training Speed1x5x faster
DexterityHighHigh
Inference MethodIterative DenoisingAutoregressive Decoding
KV-CachingN/ASupported

Reproducing π₀Fast results

We reproduce the results of π₀Fast on the LIBERO benchmark using the LeRobot implementation. We take the LeRobot PiFast base model lerobot/pi0fast-base and finetune for an additional 40kk steps in bfloat16, with batch size of 256 on 8 H100 GPUs using the HuggingFace LIBERO dataset. The finetuned model can be found here: With the following training command:
lerobot-train \
  --dataset.repo_id=lerobot/libero \
  --output_dir=outputs/libero_pi0fast \
  --job_name=libero_pi0fast \
  --policy.path=lerobot/pi0fast_base \
  --policy.dtype=bfloat16 \
  --steps=100000 \
  --save_freq=20000 \
  --batch_size=4 \
  --policy.device=cuda \
  --policy.scheduler_warmup_steps=4000 \
  --policy.scheduler_decay_steps=100000 \
  --policy.scheduler_decay_lr=1e-5 \
  --policy.gradient_checkpointing=true \
  --policy.chunk_size=10 \
  --policy.n_action_steps=10 \
  --policy.max_action_tokens=256 \
  --policy.empty_cameras=1 \
We then evaluate the finetuned model using the LeRobot LIBERO implementation, by running the following command:
tasks="libero_object,libero_spatial,libero_goal,libero_10"
lerobot-eval \
  --policy.path=lerobot/pi0fast-libero \
  --policy.max_action_tokens=256 \
  --env.type=libero \
  --policy.gradient_checkpointing=false \
  --env.task=${tasks} \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
  --rename_map='{"observation.images.image":"observation.images.base_0_rgb","observation.images.image2":"observation.images.left_wrist_0_rgb"}'
Note: We set n_action_steps=10, similar to the original OpenPI implementation.

Results

We obtain the following results on the LIBERO benchmark:
ModelLIBERO SpatialLIBERO ObjectLIBERO GoalLIBERO 10Average
π₀-fast70.0100.0100.060.082.5
The full evaluation output folder, including videos, is available here

License

This model follows the Apache 2.0 License, consistent with the original OpenPI repository.

References

Build docs developers (and LLMs) love