8000 LeRobotDataset v3 by Cadene · Pull Request #969 · huggingface/lerobot · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@Cadene
Copy link
Contributor
@Cadene Cadene commented Apr 11, 2025

What this does

Explain what this PR does. Feel free to tag your PR with the appropriate label(s).

Examples:

Title Label
Fixes #[issue] (🐛 Bug)
Adds new dataset (🗃️ Dataset)
Optimizes something (⚡️ Performance)

How it was tested

Explain/show how you tested your changes.

Examples:

  • Added test_something in tests/test_stuff.py.
  • Added new_feature and checked that training converges with policy X on dataset/environment Y.
  • Optimized some_function, it now runs X times faster than previously.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

pytest -sx tests/test_stuff.py::test_something
python lerobot/scripts/train.py --some.option=true

SECTION TO REMOVE BEFORE SUBMITTING YOUR PR

Note: Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.

Note: Before submitting this PR, please read the contributor guideline.

Simon Alibert and others added 30 commits February 10, 2025 16:39
Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>
Co-authored-by: Remi Cadene <re.cadene@gmail.com>
Co-authored-by: Remi <remi.cadene@huggingface.co>
…_v2.1' into user/rcadene/2025_02_19_port_openx
Copy link
Contributor
@fracapuano fracapuano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(shallow) review done, overall lgtm! Everything is very cool 🥳 I have just left a few minor comments here and there, and a question re: why zip(..., strict=False)

Apart from this one, happy to own the mods implementing the comments I have left @Cadene, thank you so much 💪


if roots is None:
all_metadata = [LeRobotDatasetMetadata(repo_id) for repo_id in repo_ids]
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cadene strict=False allows zipping objects of different length, truncating the longest one(s) to the "shortest" ones.
Thus, when providing roots, repo_id, root in zip(repo_ids, roots, strict=False) cuts either repo_ids or roots. When is this intended behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I see root is used to point LeRobotDataset to load the dataset from root. Why would not aggregate all the datasets you have (1) as inputs in repo_ids and (2) as inputs in root? pretty sure I am missing something but I rather throw an error and fail loudly here if supplying a list of repo_ids that does not match the associated paths 🤔

for c, f in zip(
meta.episodes["meta/episodes/chunk_index"],
meta.episodes["meta/episodes/file_index"],
strict=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it's not clear to me why you would want to not fail if you were zipping chunk and file ids of different size. What am I missing?

I understand you're splitting the content of each file into multiple chunks (right?), but strict=False means you are effectively dropping the slack files. See 👇
Screenshot 2025-05-15 at 15 59 11

aggr_videos_chunk_idx[key],
aggr_videos_file_idx[key],
)
# copy_command = f"cp {video_path} {aggr_video_path} &"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove commented out code?

chunk_index=aggr_videos_chunk_idx[key],
file_index=aggr_videos_file_idx[key],
)
if not aggr_path.exists():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I see when aggr_path.parent needs to be created 🤔 At line 193:

aggr_path = aggr_root / DEFAULT_VIDEO_PATH.format(...)  # aggr_root is aggr_path.parent

aggr_dataset[i]
pass

aggr_dataset.push_to_hub(tags=["openx"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we'd remove this right?

"stats": stats_factory(features),
}
return episodes_stats
# @pytest.fixture(scope="session")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out code?

# for ep_idx in range(total_episodes):
# flat_ep_stats = flatten_dict(stats_factory(features))
# flat_ep_stats["episode_index"] = ep_idx
# yield flat_ep_stats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out code

raise ValueError("total_length must be greater than or equal to num_episodes.")

if not tasks:
if tasks is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(very very tiny nit): I understand our v3 is mainly oriented towards being able to use DROID and Alpha, but perhaps I would open a separate PR presenting the examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋🏽 Next release of L2D is expected 7 Tb in size. (Current 700 G). Happy to test this and add as an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sandhawalia want to reach out and discuss this further? francesco.capuano@huggingface.co if you want to send an invite---anything between 10am and 9pm CEST works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. invite in your inbox.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cadene very cool! This entire SLURM series is very powerful I think it can have quite an impact. Do you think is there any way we could add it to the library to provide the community with a tested way to SLURM down/up-load things?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a scalable way of building the next release of L2D and any pointers / examples would help.

* add: tests forcing new file creation

* fix: tests depending on various sizes, and duration is updated
@imstevenpmwork
Copy link
Collaborator

I'm closing this PR as it has been superseded by #1412

@imstevenpmwork imstevenpmwork deleted the user/rcadene/2025_04_11_dataset_v3 branch January 6, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

0