LeRobotDataset

The LeRobotDataset class is a PyTorch dataset for working with robot learning data in LeRobot. It supports loading existing datasets and recording new ones.

Class Definition

from lerobot.datasets import LeRobotDataset

Location: src/lerobot/datasets/lerobot_dataset.py:566

Overview

LeRobotDataset provides:

Loading datasets from Hugging Face Hub or local storage
Recording new datasets from robot interactions
Video encoding/decoding for efficient storage
Episode-based data organization
Delta timestamps for temporal queries
Push/pull from Hugging Face Hub

Constructor

def __init__(
    self,
    repo_id: str,
    root: str | Path | None = None,
    episodes: list[int] | None = None,
    image_transforms: Callable | None = None,
    delta_timestamps: dict[str, list[float]] | None = None,
    tolerance_s: float = 1e-4,
    revision: str | None = None,
    force_cache_sync: bool = False,
    download_videos: bool = True,
    video_backend: str | None = None,
    batch_encoding_size: int = 1,
    vcodec: str = "libsvtav1",
    streaming_encoding: bool = False,
    encoder_queue_maxsize: int = 30,
    encoder_threads: int | None = None,
)

Parameters

repo_id

str

required

Repository identifier in format {username}/{dataset_name} (e.g., lerobot/pusht).

root

str | Path | None

Local directory for dataset storage. Defaults to $HF_LEROBOT_HOME/repo_id.

episodes

list[int] | None

List of episode indices to load. If None, loads all episodes.

image_transforms

Callable | None

Torchvision transforms to apply to image modalities.

delta_timestamps

dict[str, list[float]] | None

Dictionary mapping keys to lists of time offsets for temporal queries.Example:

delta_timestamps = {
    "observation.images.laptop": [0.0, -1/30],  # Current and previous frame
    "action": [0.0, 1/30, 2/30],  # Current and next 2 actions
}

tolerance_s

float

default:"1e-4"

Tolerance in seconds for timestamp validation.

revision

str | None

Git revision (branch, tag, or commit hash) for Hugging Face Hub.

force_cache_sync

bool

default:"False"

If True, refresh local files from Hub even if already cached.

download_videos

bool

default:"True"

Whether to download video files.

video_backend

str | None

Video decoding backend: "torchcodec", "pyav", or "video_reader". Auto-detects if None.

batch_encoding_size

int

default:"1"

Number of episodes to accumulate before encoding videos. Set to 1 for immediate encoding.

vcodec

str

default:"libsvtav1"

Video codec: "h264", "hevc", "libsvtav1", "auto", or hardware-specific codecs.

streaming_encoding

bool

default:"False"

If True, encode video frames in real-time during capture instead of writing PNGs first.

encoder_queue_maxsize

int

default:"30"

Maximum frames to buffer per camera when using streaming encoding.

encoder_threads

int | None

Number of threads per encoder. None uses codec default.

Properties

fps

@property
def fps(self) -> int

fps

int

Frames per second used during data collection.

num_frames

@property
def num_frames(self) -> int

num_frames

int

Number of frames in selected episodes.

num_episodes

@property
def num_episodes(self) -> int

num_episodes

int

Number of episodes selected.

features

@property
def features(self) -> dict[str, dict]

features

dict[str, dict]

All features contained in the dataset with their metadata (dtype, shape, names).

Methods

getitem

def __getitem__(self, idx: int) -> dict

Get a single frame from the dataset.

idx

int

required

Frame index.

frame

dict

Dictionary containing:

All observation keys (e.g., images, state)
action: Action taken at this timestep
episode_index: Episode this frame belongs to
frame_index: Index within the episode
timestamp: Time in seconds
task: Task description string
Delta timestamp queries if configured

push_to_hub

def push_to_hub(
    self,
    branch: str | None = None,
    tags: list | None = None,
    license: str | None = "apache-2.0",
    tag_version: bool = True,
    push_videos: bool = True,
    private: bool = False,
    allow_patterns: list[str] | str | None = None,
    upload_large_folder: bool = False,
    **card_kwargs,
) -> None

Upload dataset to Hugging Face Hub.

branch

str | None

Git branch name. If None, pushes to main.

add_frame

def add_frame(self, frame: dict) -> None

Add a frame to the current episode buffer during recording.

frame

dict

required

Dictionary containing observation and action data. Must include:

All keys from features
task: Task description string
Optional timestamp: Time in seconds (auto-generated if not provided)

finalize

def finalize(self) -> None

Close parquet writers and finalize the dataset after recording. Must be called after data collection.

Creating a New Dataset

Use the create class method to initialize a new dataset:

@classmethod
def create(
    cls,
    repo_id: str,
    fps: int,
    features: dict,
    robot_type: str | None = None,
    root: str | Path | None = None,
    use_videos: bool = True,
    metadata_buffer_size: int = 10,
    chunks_size: int | None = None,
    data_files_size_in_mb: int | None = None,
    video_files_size_in_mb: int | None = None,
) -> LeRobotDatasetMetadata

repo_id

str

required

Dataset identifier.

fps

int

required

Frames per second.

features

dict

required

Dictionary defining dataset features.

robot_type

str | None

Type of robot used for recording.

use_videos

bool

default:"True"

Whether to encode images as videos.

Usage Examples

Loading an Existing Dataset

from lerobot.datasets import LeRobotDataset

# Load full dataset
dataset = LeRobotDataset("lerobot/pusht")

# Load specific episodes
dataset = LeRobotDataset(
    "lerobot/pusht",
    episodes=[0, 1, 2]
)

# Load with delta timestamps
dataset = LeRobotDataset(
    "lerobot/pusht",
    delta_timestamps={
        "observation.images.top": [-1/30, 0.0],
        "action": [0.0, 1/30],
    }
)

Using with PyTorch DataLoader

from torch.utils.data import DataLoader

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
)

for batch in dataloader:
    images = batch["observation.images.top"]
    actions = batch["action"]
    # Training code here

Recording a New Dataset

from lerobot.datasets import LeRobotDataset

# Create dataset
dataset = LeRobotDataset.create(
    repo_id="myuser/my_dataset",
    fps=30,
    features={
        "observation.images.camera": {
            "dtype": "video",
            "shape": (480, 640, 3),
        },
        "observation.state": {
            "dtype": "float32",
            "shape": (6,),
            "names": ["x", "y", "z", "roll", "pitch", "yaw"],
        },
        "action": {
            "dtype": "float32",
            "shape": (6,),
            "names": ["x", "y", "z", "roll", "pitch", "yaw"],
        },
    },
    robot_type="my_robot",
)

# Record episode
for step in range(100):
    frame = {
        "observation.images.camera": camera.read(),
        "observation.state": robot.get_state(),
        "action": policy.get_action(),
        "task": "pick and place",
    }
    dataset.add_frame(frame)

# Save episode and push to hub
dataset.save_episode()
dataset.finalize()
dataset.push_to_hub()

Dataset Structure

LeRobotDataset uses a chunked file structure:

.
├── data/
│   ├── chunk-000/
│   │   ├── file-000.parquet
│   │   └── file-001.parquet
│   └── chunk-001/
│       └── file-000.parquet
├── meta/
│   ├── episodes/
│   │   └── chunk-000/
│   │       └── file-000.parquet
│   ├── info.json
│   ├── stats.json
│   └── tasks.parquet
└── videos/
    └── observation.images.camera/
        ├── chunk-000/
        │   ├── file-000.mp4
        │   └── file-001.mp4
        └── chunk-001/
            └── file-000.mp4

Core

Scripts

Utilities

Class Definition

Overview

Constructor

Parameters

Properties

fps

num_frames

num_episodes

features

Methods

getitem

push_to_hub

add_frame

finalize

Creating a New Dataset

Usage Examples

Loading an Existing Dataset

Using with PyTorch DataLoader

Recording a New Dataset

Dataset Structure

See Also

Build docs developers (and LLMs) love

Core

Scripts

Utilities

​Class Definition

​Overview

​Constructor

​Parameters

​Properties

​fps

​num_frames

​num_episodes

​features

​Methods

​getitem

​push_to_hub

​add_frame

​finalize

​Creating a New Dataset

​Usage Examples

​Loading an Existing Dataset

​Using with PyTorch DataLoader

​Recording a New Dataset

​Dataset Structure

​See Also

Build docs developers (and LLMs) love

Class Definition

Overview

Constructor

Parameters

Properties

fps

num_frames

num_episodes

features

Methods

getitem

push_to_hub

add_frame

finalize

Creating a New Dataset

Usage Examples

Loading an Existing Dataset

Using with PyTorch DataLoader

Recording a New Dataset

Dataset Structure

See Also