Skip to main content
The lerobot-eval command evaluates trained policies by running rollouts in environments and computing success metrics.

Command

lerobot-eval [OPTIONS]
Location: src/lerobot/scripts/lerobot_eval.py

Overview

The evaluation script:
  • Loads pretrained policies from Hugging Face Hub or local paths
  • Runs policy rollouts in simulation environments
  • Computes success rates and rewards
  • Records evaluation videos
  • Saves metrics to JSON files
  • Supports parallel evaluation across multiple tasks

Key Options

Policy Options

--policy.path
str
required
Path or Hub ID to pretrained policy (e.g., lerobot/diffusion_pusht or outputs/train/my_run/checkpoints/005000/pretrained_model).
--policy.device
str
default:"cuda"
Device for inference: cpu, cuda, cuda:0, etc.
--policy.use_amp
bool
default:"False"
Use automatic mixed precision for inference.

Environment Options

--env.type
str
required
Environment type: pusht, xarm, aloha, libero, etc.
--env.task
str
Specific task within environment suite (for LIBERO, SIMPLER, etc.).
--env.max_parallel_tasks
int
default:"1"
Maximum number of tasks to evaluate in parallel (for multi-task environments).

Evaluation Options

--eval.n_episodes
int
default:"50"
Number of episodes to evaluate.
--eval.batch_size
int
default:"10"
Number of parallel environments to run.
--eval.use_async_envs
bool
default:"False"
Use asynchronous vectorized environments for faster evaluation.

Output Options

--output_dir
str
default:"outputs/eval"
Directory for saving evaluation results and videos.
--seed
int
default:"1000"
Random seed for environment initialization.

Usage Examples

Basic Evaluation

lerobot-eval \
  --policy.path=lerobot/diffusion_pusht \
  --env.type=pusht \
  --eval.n_episodes=50 \
  --eval.batch_size=10

Evaluate Local Checkpoint

lerobot-eval \
  --policy.path=outputs/train/my_run/checkpoints/005000/pretrained_model \
  --env.type=pusht \
  --eval.n_episodes=50

Evaluate with Video Recording

Videos are automatically saved to {output_dir}/videos/:
lerobot-eval \
  --policy.path=lerobot/act_aloha_sim_insertion_human \
  --env.type=aloha \
  --env.task=insertion \
  --eval.n_episodes=100 \
  --output_dir=./eval_results

Multi-task Evaluation (LIBERO)

lerobot-eval \
  --policy.path=lerobot/pi0_libero \
  --env.type=libero \
  --env.suite=libero_90 \
  --eval.n_episodes=20 \
  --env.max_parallel_tasks=4

CPU Evaluation

lerobot-eval \
  --policy.path=lerobot/diffusion_pusht \
  --policy.device=cpu \
  --env.type=pusht \
  --eval.batch_size=1

Custom Output Directory and Seed

lerobot-eval \
  --policy.path=lerobot/diffusion_pusht \
  --env.type=pusht \
  --output_dir=./custom_eval \
  --seed=42

Output Structure

Evaluation results are saved in the following structure:
outputs/eval/
├── videos/
│   ├── eval_episode_0.mp4
│   ├── eval_episode_1.mp4
│   └── ...
└── eval_info.json

eval_info.json Format

{
  "per_episode": [
    {
      "episode_ix": 0,
      "sum_reward": 0.89,
      "max_reward": 1.0,
      "success": true,
      "seed": 1000
    },
    ...
  ],
  "aggregated": {
    "avg_sum_reward": 0.76,
    "avg_max_reward": 0.95,
    "pc_success": 68.0,
    "eval_s": 125.4,
    "eval_ep_s": 2.51
  },
  "video_paths": [
    "outputs/eval/videos/eval_episode_0.mp4",
    ...
  ]
}

Metrics

The evaluation script reports:
avg_sum_reward
float
Average cumulative reward across all episodes.
avg_max_reward
float
Average maximum reward achieved in any single step.
pc_success
float
Success rate percentage (0-100).
eval_s
float
Total evaluation time in seconds.
eval_ep_s
float
Average time per episode in seconds.

Programmatic Usage

from lerobot.scripts.lerobot_eval import eval_main
from lerobot.configs.eval import EvalPipelineConfig
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.env import EnvConfig

config = EvalPipelineConfig(
    policy=PreTrainedConfig(
        path="lerobot/diffusion_pusht",
        device="cuda",
    ),
    env=EnvConfig(type="pusht"),
    eval=EvalConfig(
        n_episodes=50,
        batch_size=10,
    ),
    output_dir="./eval_results",
)

eval_main(config)

Advanced Usage

Evaluate Specific Environment Seeds

from lerobot.scripts.lerobot_eval import eval_policy
from lerobot.policies import make_policy
from lerobot.envs import make_env

policy = make_policy(
    cfg=None,
    pretrained_path="lerobot/diffusion_pusht"
)

env = make_env("pusht", n_envs=10)

info = eval_policy(
    env=env,
    policy=policy,
    n_episodes=50,
    start_seed=42,  # Specific seed
    max_episodes_rendered=10,
    videos_dir=Path("./videos"),
)

print(f"Success rate: {info['aggregated']['pc_success']:.1f}%")

Custom Evaluation Loop

from lerobot.scripts.lerobot_eval import rollout
from lerobot.policies import make_policy
from lerobot.envs import make_env
import numpy as np

policy = make_policy(cfg=None, pretrained_path="lerobot/diffusion_pusht")
env = make_env("pusht", n_envs=1)

rollout_data = rollout(
    env=env,
    policy=policy,
    seeds=[42],
    return_observations=True,  # Include full observations
)

print(f"Actions shape: {rollout_data['action'].shape}")
print(f"Rewards: {rollout_data['reward']}")
print(f"Success: {rollout_data['success'][-1]}")

Parallel Multi-Task Evaluation

from lerobot.scripts.lerobot_eval import eval_policy_all
from lerobot.policies import make_policy
from lerobot.envs import make_env
from pathlib import Path

policy = make_policy(cfg=None, pretrained_path="lerobot/pi0_libero")

# Create dict of environments for multiple tasks
envs = {
    "libero_spatial": {
        0: make_env("libero", task_id=0, n_envs=10),
        1: make_env("libero", task_id=1, n_envs=10),
    },
    "libero_object": {
        0: make_env("libero", task_id=10, n_envs=10),
        1: make_env("libero", task_id=11, n_envs=10),
    },
}

results = eval_policy_all(
    envs=envs,
    policy=policy,
    n_episodes=20,
    max_parallel_tasks=4,
    videos_dir=Path("./videos"),
)

print("Overall:", results["overall"])
for suite, metrics in results["per_group"].items():
    print(f"{suite}: {metrics['pc_success']:.1f}% success")

Supported Environments

  • pusht: 2D pushing task
  • xarm: X-Arm manipulation
  • aloha: Bimanual manipulation tasks
  • libero: LIBERO benchmark suite
  • simpler: SIMPLER benchmark
  • Custom environments via Gymnasium interface

Tips

  1. Batch Size: Use larger batch_size for faster evaluation with more parallel environments
  2. Async Envs: Enable use_async_envs for better parallelization
  3. Video Memory: Recording videos uses disk space; adjust number of rendered episodes as needed
  4. Seeds: Use consistent seeds for reproducible comparisons
  5. Multi-GPU: Evaluation uses single GPU; for multi-task, use max_parallel_tasks for parallelism

See Also

Build docs developers (and LLMs) love