Dataset v3 #1412

michel-aractingi · 2025-06-30T14:34:02Z

LeRobotDataset v3.0

LeRobotDatasetv3 is an upgrade to the dataset infrastructure that significantly improves performance a 10BC0 nd scalability.

Key idea: Move from episodic file system to files containing chunked episodes.

Key Improvements

Faster loading: Reduced dataset initialization time
Better performance: More efficient memory usage and data access
Scalable format: New formats is developed to support large scale datasets like Droid

File Organization

Updated file structure: Transitioned from episode-based to file-based organization
- <=v2.1:
  - Files: data/chunk-000/episode_000000.parquet
  - Videos: videos/image_key/chunk-000/episode-000.mp4
- v3.0:
  - Files: data/chunk-000/file-000.parquet
  - Videos: videos/chunk-000/image_key/file-000.mp4

Metadata evolution

Unified metadata structure: All episode metadata now stored in structured parquet files
- Before: JSON Lines format (episodes.jsonl, tasks.jsonl, episodes_stats.jsonl)
- After: parquet format (meta/episodes/chunk-000/file-000.parquet)
Per-episode statistics: Enhanced statistics tracking at the episode level
Simplified episode access:
- Before: dataset.episode_data_index["from"][0].item()
- After: dataset.meta.episodes["dataset_from_index"][0]

v2.1

dataset/
├── meta/
│   ├── episodes.jsonl
│   ├── tasks.jsonl
│   ├── episodes_stats.jsonl
│   └── info.json
├── data/
│   └── chunk-000/
│       ├── episode_000000.parquet
│       └── episode_000001.parquet
└── videos/
    └── chunk-000/
        └── camera_key/
            ├── episode_000000.mp4
            └── episode_000001.mp4

v3.0

dataset/
├── meta/
│   ├── episodes/
│   │   └── chunk-000/
│   │       └── file-000.parquet
│   ├── tasks.parquet
│   ├── stats.json
│   └── info.json
├── data/
│   └── chunk-000/
│       └── file-000.parquet
└── videos/
    └── camera_key/
        └── chunk-000/
            └── file-000.mp4

New scripts

src/lerobot/datasets/aggregate.py: Functions for aggregating multiple datasets, with metadata validation and episode merging capabilities
src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py: Conversion script to migrate datasets from v2.1 to v3.0 format
examples/port_datasets/*: scripts to port droid and AgiBot datasets with slurm and datatrove support

Conversion from v2.1

A conversion script is provided in lerobot/datasets/v30/convert_dataset_v21_to_v30.py, usage:

python lerobot/datasets/v30/convert_dataset_v21_to_v30.py --repo-id=your/dataset

What Gets Converted

Data consolidation: Multiple episode files merged into optimally-sized chunks
Metadata restructuring: JSON Lines converted to structured parquet format
Video reorganization: Per-episode videos consolidated into efficient chunks
Statistics aggregation: Enhanced per-episode and global statistics

Benchmark

Dataset V3.0 benchmarks are available in this file. The benchmark compared v2.1 against six v3.0 variants with different maximum video file sizes (10, 50, 100, 250, 500, and 1000 MB) to evaluate how file size affects performance across these metrics.

Download Time (s) - Time to download the dataset
Metadata Initialization Time (s) - Time to initialize dataset metadata
Access Rate (samples/sec) - Number of samples that can be accessed per second
Memory Usage (MB) - RAM consumption during operation (unrealiable metric in this study)

Dataset V3.0 is on-par or performs better than Dataset V2.1 for some file sizes. Therefore, Dataset V3.0 allows us to support larger datasets without sacrificing the performance.

📋 TODOs

Convert all v2.1 datasets on lerobot
Cherry-pick commits for aggregate.py Fix Aggregation, Add Tests #1264
Add testing on the episode level to ensure metadata consistency during data collection
Fix replay script to filter through episode indices since data is chunked
Validate resume recording still works
Dataset V3.0 visualizers works
Add benchmarks to the PR
Port Droid in v3 format
Port AgiBot in v3 format

Next steps

Merge LeRobotDatasetStreaming to main #1165

…_10_dataset_v2.1

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Remi <remi.cadene@huggingface.co>

…_10_dataset_v2.1

…hes (#734)

…cadene/2025_02_19_port_openx

…_v2.1' into user/rcadene/2025_02_19_port_openx

CarolinePascal

Very good job folks 🎉 Let's merge this !

jackvial · 2025-09-14T17:30:08Z

@michel-aractingi these improvements look great!

Small thing I noticed when running python lerobot/datasets/v30/convert_dataset_v21_to_v30.py
The default dataset README includes the dataset info with the codebase_version e.g. I convert this dataset, and it looks correct but it was the old version in the readme https://huggingface.co/datasets/jackvial/screwdriver_attach_panel_ls_080125_14_e8.

This was a bit confusing because at first it looked like the dataset had not been converted. I understand it's a bit tricky since trying to pattern match and update the version here might accidentally overwrite changes someone has made to the readme.

Maybe going forward this info could be rendered dynamically with a dataset component or just removed from the default dataset readme to make conversion simpler.

* Dataset v3 (huggingface#1412) Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com> * Add Streaming Dataset (huggingface#1613) Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co> * Update dataset card by default (huggingface#1936) * remove condition on model card update * use names from var --------- Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com> Co-authored-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com>

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>

zhangpeng66 · 2025-12-12T05:58:52Z

how to modify to task name of ledateset

owenonline · 2026-01-15T03:01:51Z

Is there a supported way to convert back from lerobot v3 to v2.1? The GR00T repository does not work with the new schema, so a lot of newer datasets that use v3 are very hard to use for fine tuning.

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>

Simon Alibert and others added 30 commits February 10, 2025 16:39

Bump CODEBASE_VERSION

38c1457

Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02…

57c9c21

…_10_dataset_v2.1

Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02…

d67ca34

…_10_dataset_v2.1

Add frame level task (#693)

9d6886d

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

Validate features during add_frame + Add 2D-to-5D + Add string (#720)

7c2bbee

Per-episode stats (#521)

8426c64

Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Remi <remi.cadene@huggingface.co>

Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02…

aed3eb4

…_10_dataset_v2.1

Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02…

624eaf1

…_10_dataset_v2.1

support openx/rlds to lerobot

02bc4e0

Remove local_files_only and use codebase_version instead of branc…

fbf2f22

…hes (#734)

Merge remote-tracking branch 'tavish9_lerobot_openx/main' into user/r…

76436ca

…cadene/2025_02_19_port_openx

Use HF_HOME env variable (#753)

2487228

Add tag

6fe42a7

Remove dataset consolidate (#752)

969ef74

Improve doc

392a8c3

Fix batch convert

64ed525

Merge remote-tracking branch 'origin/user/aliberts/2025_02_10_dataset…

b520941

…_v2.1' into user/rcadene/2025_02_19_port_openx

WIP

71d1f5e

fix No such file or directory error

5fbbaa1

rm brake

93c80b2

workers

52fb414

before new launch from scratch

15e7a9d

new dir

eda0b99

optimize shard

689c5ef

let's go

39ad2d1

aggregate works

ff0029f

Add auto_downsample_height_width

e2e6f6e

Aggregate works

c36d225

Add upload_large_folder

3daab2a

WIP UploadDataset

3666ac9

CarolinePascal self-requested a review September 12, 2025 14:56

CarolinePascal previously approved these changes Sep 12, 2025

View reviewed changes

fix(v3.0 message): updating v3.0 backward compatibility message.

ad39bbc

CarolinePascal dismissed their stale review via ad39bbc September 12, 2025 15:25

CarolinePascal self-requested a review September 12, 2025 15:25

CarolinePascal approved these changes Sep 12, 2025

View reviewed changes

imstevenpmwork assigned michel-aractingi Sep 12, 2025

imstevenpmwork added documentation Improvements or fixes to the project’s docs enhancement Suggestions for new features or improvements dataset Issues regarding data inputs, processing, or datasets labels Sep 12, 2025

AdilZouitine approved these changes Sep 15, 2025

View reviewed changes

michel-aractingi merged commit f55c6e8 into main Sep 15, 2025
17 checks passed

michel-aractingi deleted the user/michel-aractingi/2025_06_30_dataset_v3 branch September 15, 2025 07:53

michel-aractingi mentioned this pull request Sep 15, 2025

Update dataset card by default #1936

Merged

michel-aractingi mentioned this pull request Sep 20, 2025

bump datasets to 4.0.0 #1990

Merged

This was referenced Oct 7, 2025

LeRobotDataset v3 #969

Closed

Add inference to visualize_dataset_html.py #353

Closed

tc-huang mentioned this pull request Oct 11, 2025

Data diffusion and data format conversion #2171

Open

This was referenced Oct 17, 2025

wait_image_writer #1846

Closed

[0.3.4] Severe GPU underutilization vs 0.3.3 (2x slower on same job) #2282

Closed

oxkitsune mentioned this pull request Nov 20, 2025

Support LeRobotDataset v3.0 rerun-io/rerun#11931

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset v3 #1412

Dataset v3 #1412

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Dataset v3 #1412

Dataset v3 #1412

Uh oh!

Conversation

Uh oh!

LeRobotDataset v3.0

Key Improvements

File Organization

Metadata evolution

v2.1

v3.0

New scripts

Conversion from v2.1

Benchmark

📋 TODOs

Next steps

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants