-
Notifications
You must be signed in to change notification settings - Fork 67
feat: data converter manager #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
feat: v2.1 to v2.0
[DEV] update to filter new 3.0 fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a flexible LeRobotDatasetVersionConverter and a set of modular pipelines for converting datasets between different versions. The design is solid, using a graph-based approach to find conversion paths and function handlers to encapsulate conversion logic. This greatly improves the maintainability and extensibility of dataset conversions.
I've identified a few critical issues that need to be addressed:
- A path traversal vulnerability in the v3.0 to v2.1 conversion script that could allow writing files outside the target directory.
- A bug in the
clone_treeutility function that will cause conversions to fail. - A
TypeErrorin the v1.6 to v2.0 conversion pipeline due to an incorrect function call.
Additionally, I've pointed out a high-severity issue with hardcoded user-specific paths in the example script, which should be replaced with placeholders or configured via arguments to make it portable.
Once these issues are resolved, this will be a great addition to the project.
| ensure_destination(dest_root, overwrite=overwrite) | ||
| shutil.copytree(source_root, dest_root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clone_tree function will fail with a FileExistsError. The ensure_destination function creates the destination directory, but shutil.copytree requires the destination directory to not exist. Assuming you are using Python 3.8+, you can fix this by using the dirs_exist_ok=True argument in shutil.copytree.
| ensure_destination(dest_root, overwrite=overwrite) | |
| shutil.copytree(source_root, dest_root) | |
| ensure_destination(dest_root, overwrite=overwrite) | |
| shutil.copytree(source_root, dest_root, dirs_exist_ok=True) |
|
|
||
| metadata_v1 = load_json(source_root / V1_INFO_PATH) | ||
| dataset = datasets.load_dataset("parquet", data_dir=source_root / "data", split="train") | ||
| features = _build_features(dataset, robot_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to _build_features passes two arguments, but the function is defined to only accept one (dataset). This will cause a TypeError when the code is executed. The robot_config parameter seems to be unused in _build_features.
| features = _build_features(dataset, robot_config) | |
| features = _build_features(dataset) |
| def _validate_video_paths(src: Path, dst: Path) -> None: | ||
| """Validate source and destination paths to prevent security issues.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is missing a critical security check to prevent path traversal attacks. The video_key used to construct destination paths is derived from info.json and can be controlled by a malicious actor. A video_key like ../../foo could cause ffmpeg to write files outside of the intended destination directory.
You should validate that the destination path is within a safe base directory. This requires passing the new_root directory down to _validate_video_paths and adding a check.
Here are the required changes:
-
In
convert_videos: passnew_rootto_extract_video_segment.# around line 382 _extract_video_segment(src_path, dest_path, start=start, end=end, safe_base_dir=new_root)
-
In
_extract_video_segment: update signature to acceptsafe_base_dirand pass it on.# around line 279 def _extract_video_segment( src: Path, dst: Path, start: float, end: float, *, safe_base_dir: Path, ) -> None: # Validate paths to prevent security issues _validate_video_paths(src, dst, safe_base_dir=safe_base_dir) # ... rest of the function
-
In
_validate_video_paths: update signature and add the check.# around line 221 def _validate_video_paths(src: Path, dst: Path, *, safe_base_dir: Path) -> None: """Validate source and destination paths to prevent security issues.""" # ... try: src_resolved = src.resolve(strict=True) dst_resolved = dst.resolve() safe_base_dir_resolved = safe_base_dir.resolve(strict=True) except (OSError, FileNotFoundError) as exc: raise ValueError(f"Invalid path provided: {exc}") from exc # Security: Ensure destination path is within the safe base directory if not str(dst_resolved).startswith(str(safe_base_dir_resolved)): raise ValueError(f"Path traversal attempt detected. Destination '{dst_resolved}' is outside of safe base directory '{safe_base_dir_resolved}'.") # ... rest of the function
| repo_name="yihao-brain-bot/test_data_converter", | ||
| source=Path("/home/yihao/.cache/huggingface/lerobot/yihao-brain-bot/test_data_converter"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The demonstration script contains hardcoded, user-specific paths. This makes the script not portable and leaks information about the developer's environment. It's better to use command-line arguments or relative paths with placeholders to make the example runnable for other users.
| repo_name="yihao-brain-bot/test_data_converter", | |
| source=Path("/home/yihao/.cache/huggingface/lerobot/yihao-brain-bot/test_data_converter"), | |
| repo_name="<your-repo-id>", | |
| source=Path("/path/to/your/source_dataset"), |
d971f94 to
4857706
Compare
45ad9ff to
4e7c436
Compare
What does this PR do?
LeRobotDatasetVersionConverterplus modular local pipelines so any dataset can be converted purely on disk. Each pipeline acceptssource_root/dest_root, never mutates the source unless explicitly allowed, and no longer depends on legacylerobot.commonmodules.local_pipelinespackage (v16→v20,v20↔v21,v21→v20 filtered,v21→v30,v30→v21) using currentlerobot.datasetsAPIs, legacy JSONL writers, and reusable helpers.ds_version_convert/convert.py) that demonstrates calling the converter with aDatasetContext.Outdated/removed pieces in upstream
lerobotwe had to replacelerobot.common.datasets.*andlerobot.common.robot_devices.*packages (replaced withlerobot.datasets.*andlerobot.robots.*).lerobot.datasets.v21.convert_dataset_v20_to_v21(no longer exists; version constants now defined locally).write_episode_stats,write_jsonlines,get_video_size_in_mb,EPISODES_STATS_PATH,DEFAULT_PARQUET_PATH,TASKS_PATH,EPISODES_PATH(replaced with local implementations orLEGACY_*constants).get_image_pixel_channelshelper fromlerobot.datasets.video_utils(removed; we now read shapes directly from metadata).