E522 Dataset v3 by michel-aractingi · Pull Request #1412 · huggingface/lerobot · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@michel-aractingi
Copy link
Collaborator
@michel-aractingi michel-aractingi commented Jun 30, 2025

LeRobotDataset v3.0

LeRobotDatasetv3 is an upgrade to the dataset infrastructure that significantly improves performance a 10BC0 nd scalability.

Key idea: Move from episodic file system to files containing chunked episodes.

Key Improvements

  • Faster loading: Reduced dataset initialization time
  • Better performance: More efficient memory usage and data access
  • Scalable format: New formats is developed to support large scale datasets like Droid

File Organization

  • Updated file structure: Transitioned from episode-based to file-based organization
    • <=v2.1:
      • Files: data/chunk-000/episode_000000.parquet
      • Videos: videos/image_key/chunk-000/episode-000.mp4
    • v3.0:
      • Files: data/chunk-000/file-000.parquet
      • Videos: videos/chunk-000/image_key/file-000.mp4

Metadata evolution

  • Unified metadata structure: All episode metadata now stored in structured parquet files
    • Before: JSON Lines format (episodes.jsonl, tasks.jsonl, episodes_stats.jsonl)
    • After: parquet format (meta/episodes/chunk-000/file-000.parquet)
  • Per-episode statistics: Enhanced statistics tracking at the episode level
  • Simplified episode access:
    • Before: dataset.episode_data_index["from"][0].item()
    • After: dataset.meta.episodes["dataset_from_index"][0]

v2.1

dataset/
├── meta/
│   ├── episodes.jsonl
│   ├── tasks.jsonl
│   ├── episodes_stats.jsonl
│   └── info.json
├── data/
│   └── chunk-000/
│       ├── episode_000000.parquet
│       └── episode_000001.parquet
└── videos/
    └── chunk-000/
        └── camera_key/
            ├── episode_000000.mp4
            └── episode_000001.mp4

v3.0

dataset/
├── meta/
│   ├── episodes/
│   │   └── chunk-000/
│   │       └── file-000.parquet
│   ├── tasks.parquet
│   ├── stats.json
│   └── info.json
├── data/
│   └── chunk-000/
│       └── file-000.parquet
└── videos/
    └── camera_key/
        └── chunk-000/
            └── file-000.mp4

New scripts

  • src/lerobot/datasets/aggregate.py: Functions for aggregating multiple datasets, with metadata validation and episode merging capabilities
  • src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py: Conversion script to migrate datasets from v2.1 to v3.0 format
  • examples/port_datasets/*: scripts to port droid and AgiBot datasets with slurm and datatrove support

Conversion from v2.1

A conversion script is provided in lerobot/datasets/v30/convert_dataset_v21_to_v30.py, usage:

python lerobot/datasets/v30/convert_dataset_v21_to_v30.py --repo-id=your/dataset

What Gets Converted

  1. Data consolidation: Multiple episode files merged into optimally-sized chunks
  2. Metadata restructuring: JSON Lines converted to structured parquet format
  3. Video reorganization: Per-episode videos consolidated into efficient chunks
  4. Statistics aggregation: Enhanced per-episode and global statistics

Benchmark

Dataset V3.0 benchmarks are available in this file. The benchmark compared v2.1 against six v3.0 variants with different maximum video file sizes (10, 50, 100, 250, 500, and 1000 MB) to evaluate how file size affects performance across these metrics.

  1. Download Time (s) - Time to download the dataset
  2. Metadata Initialization Time (s) - Time to initialize dataset metadata
  3. Access Rate (samples/sec) - Number of samples that can be accessed per second
  4. Memory Usage (MB) - RAM consumption during operation (unrealiable metric in this study)

Dataset V3.0 is on-par or performs better than Dataset V2.1 for some file sizes. Therefore, Dataset V3.0 allows us to support larger datasets without sacrificing the performance.

📋 TODOs

  • Convert all v2.1 datasets on lerobot
  • Cherry-pick commits for aggregate.py Fix Aggregation, Add Tests #1264
  • Add testing on the episode level to ensure metadata consistency during data collection
  • Fix replay script to filter through episode indices since data is chunked
  • Validate resume recording still works
  • Dataset V3.0 visualizers works
  • Add benchmarks to the PR
  • Port Droid in v3 format
  • Port AgiBot in v3 format

Next steps

Merge LeRobotDatasetStreaming to main #1165

Simon Alibert and others added 30 commits February 10, 2025 16:39
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Remi <remi.cadene@huggingface.co>
…_v2.1' into user/rcadene/2025_02_19_port_openx
@CarolinePascal CarolinePascal self-requested a review September 12, 2025 14:56
CarolinePascal
CarolinePascal previously approved these changes Sep 12, 2025
Copy link
Collaborator
@CarolinePascal CarolinePascal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job folks 🎉 Let's merge this !

@imstevenpmwork imstevenpmwork added documentation Improvements or fixes to the project’s docs enhancement Suggestions for new features or improvements dataset Issues regarding data inputs, processing, or datasets labels Sep 12, 2025
@jackvial
Copy link
Contributor

@michel-aractingi these improvements look great!

Small thing I noticed when running python lerobot/datasets/v30/convert_dataset_v21_to_v30.py
The default dataset README includes the dataset info with the codebase_version e.g. I convert this dataset, and it looks correct but it was the old version in the readme https://huggingface.co/datasets/jackvial/screwdriver_attach_panel_ls_080125_14_e8.

This was a bit confusing because at first it looked like the dataset had not been converted. I understand it's a bit tricky since trying to pattern match and update the version here might accidentally overwrite changes someone has made to the readme.

Maybe going forward this info could be rendered dynamically with a dataset component or just removed from the default dataset readme to make conversion simpler.

Screenshot 2025-09-14 at 1 21 41 PM Screenshot 2025-09-14 at 1 28 24 PM

@michel-aractingi michel-aractingi merged commit f55c6e8 into main Sep 15, 2025
17 checks passed
@michel-aractingi michel-aractingi deleted the user/michel-aractingi/2025_06_30_dataset_v3 branch September 15, 2025 07:53
shawnpatel added a commit to almond-bot/lerobot that referenced this pull request Sep 15, 2025
* Dataset v3 (huggingface#1412)

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Tavish <tavish9.chen@gmail.com>
Co-authored-by: fracapuano <francesco.capuano@huggingface.co>
Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>

* Add Streaming Dataset (huggingface#1613)

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>

* Update dataset card by default (huggingface#1936)

* remove condition on model card update

* use names from var

---------

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Tavish <tavish9.chen@gmail.com>
Co-authored-by: fracapuano <francesco.capuano@huggingface.co>
Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
Co-authored-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com>
WangYixuan12 pushed a commit to WangYixuan12/lerobot that referenced this pull request Sep 17, 2025
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Tavish <tavish9.chen@gmail.com>
Co-authored-by: fracapuano <francesco.capuano@huggingface.co>
Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
manmohan659 pushed a commit to manmohan659/lerobot that referenced this pull request Sep 27, 2025
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Tavish <tavish9.chen@gmail.com>
Co-authored-by: fracapuano <francesco.capuano@huggingface.co>
Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
brysonjones pushed a commit to brysonjones/lerobot that referenced this pull request Nov 12, 2025
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Tavish <tavish9.chen@gmail.com>
Co-authored-by: fracapuano <francesco.capuano@huggingface.co>
Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
nepyope pushed a commit that referenced this pull request Nov 21, 2025
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Tavish <tavish9.chen@gmail.com>
Co-authored-by: fracapuano <francesco.capuano@huggingface.co>
Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
@zhangpeng66
Copy link

how to modify to task name of ledateset

@owenonline
Copy link

Is there a supported way to convert back from lerobot v3 to v2.1? The GR00T repository does not work with the new schema, so a lot of newer datasets that use v3 are very hard to use for fine tuning.

sandhya-cb pushed a commit to sandhya-cb/lerobot-clutterbot that referenced this pull request Jan 28, 2026
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Tavish <tavish9.chen@gmail.com>
Co-authored-by: fracapuano <francesco.capuano@huggingface.co>
Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets documentation Improvements or fixes to the project’s docs enhancement Suggestions for new features or improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0