LeRobotDataset

Overview

LeRobotDataset is LeRobot’s standardized format for storing, sharing, and loading robot learning data. It provides:

Efficient storage: Chunked parquet files and compressed videos
Hub integration: Built-in support for Hugging Face Hub
Flexible access: Load full datasets or individual episodes
Rich metadata: Statistics, task labels, and feature descriptions

Dataset Structure

A LeRobotDataset has the following directory structure:

my_dataset/
├── data/                    # Robot observations and actions
│   ├── chunk-000/
│   │   ├── file-000.parquet
│   │   └── file-001.parquet
│   └── chunk-001/
│       └── ...
├── meta/                    # Dataset metadata
│   ├── episodes/
│   │   └── chunk-000/
│   │       └── file-000.parquet
│   ├── info.json           # Dataset configuration
│   ├── stats.json          # Normalization statistics
│   └── tasks.parquet       # Task descriptions
└── videos/                  # Camera recordings
    ├── observation.images.laptop/
    │   └── chunk-000/
    │       ├── file-000.mp4
    │       └── file-001.mp4
    └── observation.images.phone/
        └── ...

Source: src/lerobot/datasets/lerobot_dataset.py:617

Creating a Dataset

Use the LeRobotDataset.create() classmethod to initialize a new dataset:

from lerobot.datasets import LeRobotDataset

dataset = LeRobotDataset.create(
    repo_id="username/my_robot_data",
    fps=30,
    robot_type="so_follower",
    features={
        "observation.state": {
            "dtype": "float32",
            "shape": (6,),
            "names": ["shoulder_pan", "shoulder_lift", "elbow_flex",
                     "wrist_flex", "wrist_roll", "gripper"],
        },
        "observation.images.top": {
            "dtype": "video",
            "shape": (480, 640, 3),
            "names": ["height", "width", "channel"],
        },
        "action": {
            "dtype": "float32",
            "shape": (6,),
            "names": ["shoulder_pan", "shoulder_lift", "elbow_flex",
                     "wrist_flex", "wrist_roll", "gripper"],
        },
    },
    use_videos=True,
)

Source: src/lerobot/datasets/lerobot_dataset.py:499

Recording Episodes

Adding Frames

Record data frame-by-frame:

for timestep in range(episode_length):
    frame = {
        "observation.state": robot.get_observation()["state"],
        "observation.images.top": robot.get_observation()["camera_top"],
        "action": action,
        "task": "pick_and_place",
    }
    dataset.add_frame(frame)

Source: src/lerobot/datasets/lerobot_dataset.py:1171

Saving Episodes

After collecting all frames, save the episode:

dataset.save_episode(
    episode_index=0,
    task="pick_and_place",
)

This method:

Encodes video frames (if using videos)
Writes data to parquet files
Computes and stores episode statistics
Updates dataset metadata

Source: src/lerobot/datasets/lerobot_dataset.py:1200

Finalizing

Always call finalize() when done recording:

dataset.finalize()

This ensures all parquet files are properly closed and metadata is written. Source: src/lerobot/datasets/lerobot_dataset.py:1131

Loading a Dataset

From Local Disk

dataset = LeRobotDataset(
    repo_id="username/my_robot_data",
    root="/path/to/dataset",
)

From Hugging Face Hub

dataset = LeRobotDataset(
    repo_id="lerobot/pusht",
    download_videos=True,
)

The dataset will be automatically downloaded to ~/.cache/huggingface/lerobot/. Source: src/lerobot/datasets/lerobot_dataset.py:566

Loading Specific Episodes

dataset = LeRobotDataset(
    repo_id="lerobot/pusht",
    episodes=[0, 1, 2, 5, 10],  # Only load these episodes
)

Source: src/lerobot/datasets/lerobot_dataset.py:571

Accessing Data

LeRobotDataset inherits from torch.utils.data.Dataset, so it works with PyTorch DataLoaders:

from torch.utils.data import DataLoader

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
)

for batch in dataloader:
    observations = batch["observation.state"]
    actions = batch["action"]
    # Train your policy...

Single Frame Access

frame = dataset[0]
print(frame.keys())
# dict_keys(['observation.state', 'observation.images.top', 'action',
#            'episode_index', 'frame_index', 'timestamp', 'task', ...])

Source: src/lerobot/datasets/lerobot_dataset.py:1082

Dataset Properties

Metadata

print(dataset.fps)              # Frames per second (e.g., 30)
print(dataset.num_episodes)     # Total number of episodes
print(dataset.num_frames)       # Total number of frames
print(dataset.features)         # Feature definitions
print(dataset.meta.robot_type)  # Robot type used for recording

Source: src/lerobot/datasets/lerobot_dataset.py:939

Statistics

Datasets include normalization statistics computed from the data:

stats = dataset.meta.stats
print(stats["action"]["mean"])  # Mean action values
print(stats["action"]["std"])   # Standard deviation
print(stats["action"]["min"])   # Minimum values
print(stats["action"]["max"])   # Maximum values

Source: src/lerobot/datasets/lerobot_dataset.py:169

Tasks

View all tasks in the dataset:

print(dataset.meta.tasks)
#        task_index
# task               
# pick    0
# place   1

Source: src/lerobot/datasets/lerobot_dataset.py:166

Delta Timestamps

Load temporal sequences of observations/actions:

dataset = LeRobotDataset(
    repo_id="lerobot/pusht",
    delta_timestamps={
        "observation.state": [-0.1, 0.0],     # 100ms ago and now
        "action": [0.0, 0.033, 0.066, 0.1],  # Next 4 actions
    },
)

frame = dataset[100]
print(frame["observation.state"].shape)  # (2, 6) - 2 timesteps
print(frame["action"].shape)             # (4, 6) - 4 future actions

Source: src/lerobot/datasets/lerobot_dataset.py:676

Image Transforms

Apply transforms to visual modalities:

from torchvision.transforms import v2 as transforms

transform = transforms.Compose([
    transforms.RandomCrop(224),
    transforms.ColorJitter(brightness=0.2),
])

dataset = LeRobotDataset(
    repo_id="lerobot/pusht",
    image_transforms=transform,
)

Source: src/lerobot/datasets/lerobot_dataset.py:674

Pushing to Hub

Share your dataset on Hugging Face Hub:

dataset.push_to_hub(
    branch="main",
    tags=["robotics", "manipulation"],
    license="apache-2.0",
    push_videos=True,
)

This will:

Create a dataset repository (if it doesn’t exist)
Upload all parquet files and videos
Generate a dataset card with metadata
Tag the release with the codebase version

Source: src/lerobot/datasets/lerobot_dataset.py:796

Video Encoding

LeRobotDataset supports multiple video encoding options:

Codec Selection

dataset = LeRobotDataset.create(
    repo_id="username/my_data",
    fps=30,
    features=features,
    vcodec="libsvtav1",  # Options: 'h264', 'hevc', 'libsvtav1', 'auto'
)

Source: src/lerobot/datasets/lerobot_dataset.py:696

Streaming Encoding

Encode videos in real-time during recording for faster save_episode():

dataset = LeRobotDataset.create(
    repo_id="username/my_data",
    fps=30,
    features=features,
    streaming_encoding=True,  # Encode frames as they arrive
    encoder_queue_maxsize=30,  # Buffer size per camera
)

Source: src/lerobot/datasets/lerobot_dataset.py:699

Dataset Info

The info.json file contains essential dataset configuration:

{
  "codebase_version": "v3.0",
  "fps": 30,
  "robot_type": "so_follower",
  "total_episodes": 100,
  "total_frames": 50000,
  "total_tasks": 3,
  "features": {...},
  "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet",
  "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"
}

Source: src/lerobot/datasets/lerobot_dataset.py:163

Best Practices

Batch Encoding: Set batch_encoding_size > 1 to encode multiple episodes in parallel, reducing total recording time.

Always finalize: Failing to call dataset.finalize() will result in corrupted parquet files that cannot be loaded.

Version compatibility: Datasets created with v3.0 are not compatible with older versions of LeRobot. Use the conversion script if needed.

Next Steps

Learn how to use datasets with Policies
Understand Processors for data normalization
Explore the Robot interface for recording data

Get Started

Core Concepts

Tutorials

Datasets

Simulation

Inference

Advanced

Overview

Dataset Structure

Creating a Dataset

Recording Episodes

Adding Frames

Saving Episodes

Finalizing

Loading a Dataset

From Local Disk

From Hugging Face Hub

Loading Specific Episodes

Accessing Data

Single Frame Access

Dataset Properties

Metadata

Statistics

Tasks

Delta Timestamps

Image Transforms

Pushing to Hub

Video Encoding

Codec Selection

Streaming Encoding

Dataset Info

Best Practices

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Tutorials

Datasets

Simulation

Inference

Advanced

​Overview

​Dataset Structure

​Creating a Dataset

​Recording Episodes

​Adding Frames

​Saving Episodes

​Finalizing

​Loading a Dataset

​From Local Disk

​From Hugging Face Hub

​Loading Specific Episodes

​Accessing Data

​Single Frame Access

​Dataset Properties

​Metadata

​Statistics

​Tasks

​Delta Timestamps

​Image Transforms

​Pushing to Hub

​Video Encoding

​Codec Selection

​Streaming Encoding

​Dataset Info

​Best Practices

​Next Steps

Build docs developers (and LLMs) love

Overview

Dataset Structure

Creating a Dataset

Recording Episodes

Adding Frames

Saving Episodes

Finalizing

Loading a Dataset

From Local Disk

From Hugging Face Hub

Loading Specific Episodes

Accessing Data

Single Frame Access

Dataset Properties

Metadata

Statistics

Tasks

Delta Timestamps

Image Transforms

Pushing to Hub

Video Encoding

Codec Selection

Streaming Encoding

Dataset Info

Best Practices

Next Steps