-
Notifications
You must be signed in to change notification settings - Fork 3.6k
LeRobotDataset v3 #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LeRobotDataset v3 #969
Conversation
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com> Co-authored-by: Remi <remi.cadene@huggingface.co>
…cadene/2025_02_19_port_openx
…_v2.1' into user/rcadene/2025_02_19_port_openx
switch back to set_transform instead of set_format Add video_files_size_in_mb pre-commit run --all-files
…v3' into user/rcadene/2025_04_11_dataset_v3
…hen fix test_datasets and test_aggregate (all passing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(shallow) review done, overall lgtm! Everything is very cool 🥳 I have just left a few minor comments here and there, and a question re: why zip(..., strict=False)
Apart from this one, happy to own the mods implementing the comments I have left @Cadene, thank you so much 💪
lerobot/common/datasets/aggregate.py
Outdated
|
|
||
| if roots is None: | ||
| all_metadata = [LeRobotDatasetMetadata(repo_id) for repo_id in repo_ids] | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cadene strict=False allows zipping objects of different length, truncating the longest one(s) to the "shortest" ones.
Thus, when providing roots, repo_id, root in zip(repo_ids, roots, strict=False) cuts either repo_ids or roots. When is this intended behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, I see root is used to point LeRobotDataset to load the dataset from root. Why would not aggregate all the datasets you have (1) as inputs in repo_ids and (2) as inputs in root? pretty sure I am missing something but I rather throw an error and fail loudly here if supplying a list of repo_ids that does not match the associated paths 🤔
| for c, f in zip( | ||
| meta.episodes["meta/episodes/chunk_index"], | ||
| meta.episodes["meta/episodes/file_index"], | ||
| strict=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lerobot/common/datasets/aggregate.py
Outdated
| aggr_videos_chunk_idx[key], | ||
| aggr_videos_file_idx[key], | ||
| ) | ||
| # copy_command = f"cp {video_path} {aggr_video_path} &" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove commented out code?
lerobot/common/datasets/aggregate.py
Outdated
| chunk_index=aggr_videos_chunk_idx[key], | ||
| file_index=aggr_videos_file_idx[key], | ||
| ) | ||
| if not aggr_path.exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I see when aggr_path.parent needs to be created 🤔 At line 193:
aggr_path = aggr_root / DEFAULT_VIDEO_PATH.format(...) # aggr_root is aggr_path.parent
lerobot/common/datasets/aggregate.py
Outdated
| aggr_dataset[i] | ||
| pass | ||
|
|
||
| aggr_dataset.push_to_hub(tags=["openx"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we'd remove this right?
| "stats": stats_factory(features), | ||
| } | ||
| return episodes_stats | ||
| # @pytest.fixture(scope="session") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented out code?
| # for ep_idx in range(total_episodes): | ||
| # flat_ep_stats = flatten_dict(stats_factory(features)) | ||
| # flat_ep_stats["episode_index"] = ep_idx | ||
| # yield flat_ep_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out code
| raise ValueError("total_length must be greater than or equal to num_episodes.") | ||
|
|
||
| if not tasks: | ||
| if tasks is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(very very tiny nit): I understand our v3 is mainly oriented towards being able to use DROID and Alpha, but perhaps I would open a separate PR presenting the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋🏽 Next release of L2D is expected 7 Tb in size. (Current 700 G). Happy to test this and add as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sandhawalia want to reach out and discuss this further? francesco.capuano@huggingface.co if you want to send an invite---anything between 10am and 9pm CEST works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. invite in your inbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cadene very cool! This entire SLURM series is very powerful I think it can have quite an impact. Do you think is there any way we could add it to the library to provide the community with a tested way to SLURM down/up-load things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a scalable way of building the next release of L2D and any pointers / examples would help.
…v3' into user/rcadene/2025_04_11_dataset_v3
* add: tests forcing new file creation * fix: tests depending on various sizes, and duration is updated
|
I'm closing this PR as it has been superseded by #1412 |

What this does
Explain what this PR does. Feel free to tag your PR with the appropriate label(s).
Examples:
How it was tested
Explain/show how you tested your changes.
Examples:
test_somethingintests/test_stuff.py.new_featureand checked that training converges with policy X on dataset/environment Y.some_function, it now runs X times faster than previously.How to checkout & try? (for the reviewer)
Provide a simple way for the reviewer to try out your changes.
Examples:
SECTION TO REMOVE BEFORE SUBMITTING YOUR PR
Note: Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.
Note: Before submitting this PR, please read the contributor guideline.