8000 LeRobotDataset v3 by Cadene · Pull Request #969 · huggingface/lerobot · GitHub

LeRobotDataset v3 #969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

Cadene wants to merge 72 commits into main from user/rcadene/2025_04_11_dataset_v3

Contributor

Cadene commented

What this does

Explain what this PR does. Feel free to tag your PR with the appropriate label(s).

Examples:

Title	Label
Fixes #[issue]	(🐛 Bug)
Adds new dataset	(🗃️ Dataset)
Optimizes something	(⚡️ Performance)

How it was tested

Explain/show how you tested your changes.

Examples:

Added test_something in tests/test_stuff.py.
Added new_feature and checked that training converges with policy X on dataset/environment Y.
Optimized some_function, it now runs X times faster than previously.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

pytest -sx tests/test_stuff.py::test_something

python lerobot/scripts/train.py --some.option=true

SECTION TO REMOVE BEFORE SUBMITTING YOUR PR

Note: Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.

Note: Before submitting this PR, please read the contributor guideline.

Simon Alibert and others added 30 commits

February 10, 2025 16:39


          Bump CODEBASE_VERSION

38c1457


          Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02…

57c9c21

…_10_dataset_v2.1


          Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02…

d67ca34

…_10_dataset_v2.1


          Add frame level task (#693)

9d6886d

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>


          Validate features during add_frame + Add 2D-to-5D + Add string (#720)

7c2bbee


          Per-episode stats (#521)

8426c64

Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Remi <remi.cadene@huggingface.co>


          Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02…

aed3eb4

…_10_dataset_v2.1


          Merge remote-tracking branch 'origin/main' into user/aliberts/2025_02…

624eaf1

…_10_dataset_v2.1


          support openx/rlds to lerobot

02bc4e0


          Remove local_files_only and use codebase_version instead of branc…

fbf2f22

…hes (#734)


          Merge remote-tracking branch 'tavish9_lerobot_openx/main' into user/r…

76436ca

…cadene/2025_02_19_port_openx


          Use HF_HOME env variable (#753)


          Add tag

6fe42a7


          Remove dataset consolidate (#752)

10BC0

969ef74


          Improve doc

392a8c3


          Fix batch convert

64ed525


          Merge remote-tracking branch 'origin/user/aliberts/2025_02_10_dataset…

b520941

…_v2.1' into user/rcadene/2025_02_19_port_openx

WIP

71d1f5e


          fix No such file or directory error

5fbbaa1


          rm brake

93c80b2


          workers

52fb414


          before new launch from scratch

15e7a9d


          new dir

eda0b99


          optimize shard

689c5ef


          let's go

39ad2d1


          aggregate works

ff0029f


          Add auto_downsample_height_width

e2e6f6e


          Aggregate works

c36d225


          Add upload_large_folder

3daab2a


          WIP UploadDataset

3666ac9

Cadene added 11 commits

April 22, 2025 10:35


          Fix unit tests

367d9bd


          Faster self.meta.episodes[...]

d518b03

switch back to set_transform instead of set_format

Add video_files_size_in_mb

pre-commit run --all-files


          Merge remote-tracking branch 'origin/user/rcadene/2025_04_11_dataset_…

7c005c2

…v3' into user/rcadene/2025_04_11_dataset_v3


          fix hf_dataset.set_transform(hf_transform_to_torch)

71715c3


          Fix convert v30 with image datasets

253c649


          Aggregate: Add concatenation

e11d2e4


          Fix aggregate (num_frames, dataset_from_index, index)

588bf96


          Speedup data loading

0309a9f


          Uploaded droid 1.0.1

1ecaeab


          Fix visualize_dataset with rerun

e88af0e


          In tests: Add use_videos=False by default, Create mp4 file if True, t…

e07cb52

…hen fix test_datasets and test_aggregate (all passing)

$fracapuano$

fracapuano reviewed

View reviewed changes

Contributor

$@fracapuano$ fracapuano left a comment

(shallow) review done, overall lgtm! Everything is very cool 🥳 I have just left a few minor comments here and there, and a question re: why zip(..., strict=False)

Apart from this one, happy to own the mods implementing the comments I have left @Cadene, thank you so much 💪

lerobot/common/datasets/aggregate.py Outdated

+                  if roots is None:
+                      all_metadata = [LeRobotDatasetMetadata(repo_id) for repo_id in repo_ids]
+                  else:

Contributor

$@fracapuano$ fracapuano

@Cadene strict=False allows zipping objects of different length, truncating the longest one(s) to the "shortest" ones.
Thus, when providing roots, repo_id, root in zip(repo_ids, roots, strict=False) cuts either repo_ids or roots. When is this intended behavior?

Contributor

$@fracapuano$ fracapuano

For completeness, I see root is used to point LeRobotDataset to load the dataset from root. Why would not aggregate all the datasets you have (1) as inputs in repo_ids and (2) as inputs in root? pretty sure I am missing something but I rather throw an error and fail loudly here if supplying a list of repo_ids that does not match the associated paths 🤔

lerobot/common/datasets/aggregate.py

+                          for c, f in zip(
+                              meta.episodes["meta/episodes/chunk_index"],
+                              meta.episodes["meta/episodes/file_index"],
+                              strict=False,

Contributor

$@fracapuano$ fracapuano

Again, it's not clear to me why you would want to not fail if you were zipping chunk and file ids of different size. What am I missing?

I understand you're splitting the content of each file into multiple chunks (right?), but strict=False means you are effectively dropping the slack files. See 👇

lerobot/common/datasets/aggregate.py Outdated

+                                          aggr_videos_chunk_idx[key],
+                                          aggr_videos_file_idx[key],
+                                      )
+                                  # copy_command = f"cp {video_path} {aggr_video_path} &"

Contributor

$@fracapuano$ fracapuano

Nit: remove commented out code?

lerobot/common/datasets/aggregate.py Outdated

+                                  chunk_index=aggr_videos_chunk_idx[key],
+                                  file_index=aggr_videos_file_idx[key],
+                              )
+                              if not aggr_path.exists():

Contributor

$@fracapuano$ fracapuano

Not sure I see when aggr_path.parent needs to be created 🤔 At line 193:

aggr_path = aggr_root / DEFAULT_VIDEO_PATH.format(...)  # aggr_root is aggr_path.parent

lerobot/common/datasets/aggregate.py Outdated

+                      aggr_dataset[i]
+                      pass
+                  aggr_dataset.push_to_hub(tags=["openx"])

Contributor

$@fracapuano$ fracapuano

Nit: we'd remove this right?

tests/fixtures/dataset_factories.py

-                              "stats": stats_factory(features),
-                          }
-                      return episodes_stats
+              # @pytest.fixture(scope="session")

Contributor

$@fracapuano$ fracapuano

Remove commented out code?

tests/fixtures/dataset_factories.py

+              #             for ep_idx in range(total_episodes):
+              #                 flat_ep_stats = flatten_dict(stats_factory(features))
+              #                 flat_ep_stats["episode_index"] = ep_idx
+              #                 yield flat_ep_stats

Contributor

$@fracapuano$ fracapuano

Remove commented-out code

tests/fixtures/dataset_factories.py

                           raise ValueError("total_length must be greater than or equal to num_episodes.")
-                      if not tasks:
+                      if tasks is None:

Contributor

$@fracapuano$ fracapuano

❤️

examples/port_datasets/droid_rlds/slurm_aggregate_shards.py

Contributor

$@fracapuano$ fracapuano

(very very tiny nit): I understand our v3 is mainly oriented towards being able to use DROID and Alpha, but perhaps I would open a separate PR presenting the examples.

Contributor

sandhawalia

👋🏽 Next release of L2D is expected 7 Tb in size. (Current 700 G). Happy to test this and add as an example.

Contributor

$@fracapuano$ fracapuano

Hey @sandhawalia want to reach out and discuss this further? francesco.capuano@huggingface.co if you want to send an invite---anything between 10am and 9pm CEST works

Contributor

sandhawalia

Sure thing. invite in your inbox.

examples/port_datasets/droid_rlds/slurm_upload.py

Contributor

$@fracapuano$ fracapuano

@Cadene very cool! This entire SLURM series is very powerful I think it can have quite an impact. Do you think is there any way we could add it to the library to provide the community with a tested way to SLURM down/up-load things?

Contributor

sandhawalia

We also need a scalable way of building the next release of L2D and any pointers / examples would help.

Cadene mentioned this pull request

Fix memory leak in LeRobotDataset._save_episode_table #1113

Open

Cadene added 2 commits

May 16, 2025 17:41


          WIP aggregate

8d36092


          Merge remote-tracking branch 'origin/user/rcadene/2025_04_11_dataset_…

f07887e

…v3' into user/rcadene/2025_04_11_dataset_v3

$@fracapuano$ fracapuano mentioned this pull request

Add Streaming Dataset #1165

Closed

Cadene added 2 commits

May 28, 2025 17:29


          WIP after Francesco discussion


          WIP after Francesco discussion

41132be

$@fracapuano$ fracapuano mentioned this pull request

Add support for delta timesteps streaming #1173

Merged

imstevenpmwork assigned fracapuano and Cadene

fracapuano added 4 commits

June 7, 2025 00:47


          add: support for videos generation in datasets

6d9374c


          fix: debug aggregation code

4a570b5


          add: tests for aggregation code

fbef584


          fix: modularize tests to improve readability

aliberts mentioned this pull request

Large scale training #1281

Open

$@fracapuano$


          Fix Aggregation, Add Tests (#1264)

* add: tests forcing new file creation

* fix: tests depending on various sizes, and duration is updated

Collaborator

imstevenpmwork commented

I'm closing this PR as it has been superseded by #1412

imstevenpmwork closed this

imstevenpmwork deleted the user/rcadene/2025_04_11_dataset_v3 branch

January 6, 2026 22:45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset enhancement performance

0