-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Dataset v3 #1412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset v3 #1412
Conversation
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Remi <remi.cadene@huggingface.co>
…cadene/2025_02_19_port_openx
…_v2.1' into user/rcadene/2025_02_19_port_openx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job folks 🎉 Let's merge this !
|
@michel-aractingi these improvements look great! Small thing I noticed when running This was a bit confusing because at first it looked like the dataset had not been converted. I understand it's a bit tricky since trying to pattern match and update the version here might accidentally overwrite changes someone has made to the readme. Maybe going forward this info could be rendered dynamically with a dataset component or just removed from the default dataset readme to make conversion simpler.
|
* Dataset v3 (huggingface#1412) Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com> * Add Streaming Dataset (huggingface#1613) Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co> * Update dataset card by default (huggingface#1936) * remove condition on model card update * use names from var --------- Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com> Co-authored-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com>
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>
|
how to modify to task name of ledateset |
|
Is there a supported way to convert back from lerobot v3 to v2.1? The GR00T repository does not work with the new schema, so a lot of newer datasets that use v3 are very hard to use for fine tuning. |
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com> Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Tavish <tavish9.chen@gmail.com> Co-authored-by: fracapuano <francesco.capuano@huggingface.co> Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com>


LeRobotDataset v3.0
LeRobotDatasetv3 is an upgrade to the dataset infrastructure that significantly improves performance a 10BC0 nd scalability.
Key idea: Move from episodic file system to files containing chunked episodes.
Key Improvements
File Organization
data/chunk-000/episode_000000.parquetvideos/image_key/chunk-000/episode-000.mp4data/chunk-000/file-000.parquetvideos/chunk-000/image_key/file-000.mp4Metadata evolution
episodes.jsonl,tasks.jsonl,episodes_stats.jsonl)meta/episodes/chunk-000/file-000.parquet)dataset.episode_data_index["from"][0].item()dataset.meta.episodes["dataset_from_index"][0]v2.1
v3.0
New scripts
src/lerobot/datasets/aggregate.py: Functions for aggregating multiple datasets, with metadata validation and episode merging capabilitiessrc/lerobot/datasets/v30/convert_dataset_v21_to_v30.py: Conversion script to migrate datasets from v2.1 to v3.0 formatexamples/port_datasets/*: scripts to port droid and AgiBot datasets with slurm and datatrove supportConversion from v2.1
A conversion script is provided in
lerobot/datasets/v30/convert_dataset_v21_to_v30.py, usage:What Gets Converted
Benchmark
Dataset V3.0 benchmarks are available in this file. The benchmark compared v2.1 against six v3.0 variants with different maximum video file sizes (10, 50, 100, 250, 500, and 1000 MB) to evaluate how file size affects performance across these metrics.
Dataset V3.0 is on-par or performs better than Dataset V2.1 for some file sizes. Therefore, Dataset V3.0 allows us to support larger datasets without sacrificing the performance.
📋 TODOs
aggregate.pyFix Aggregation, Add Tests #1264Next steps
Merge LeRobotDatasetStreaming to main #1165