8000 feat: data converter manager by bingogome · Pull Request #73 · Tavish9/any4lerobot · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@bingogome
Copy link
Contributor

What does this PR do?

  • Introduces LeRobotDatasetVersionConverter plus modular local pipelines so any dataset can be converted purely on disk. Each pipeline accepts source_root/dest_root, never mutates the source unless explicitly allowed, and no longer depends on legacy lerobot.common modules.
  • It searches a route to convert the current dataset to a target version. It may take multi-steps. The original local dataset won't be modified.
  • Refactors all existing conversion flows into the local_pipelines package (v16→v20, v20↔v21, v21→v20 filtered, v21→v30, v30→v21) using current lerobot.datasets APIs, legacy JSONL writers, and reusable helpers.
  • Adds a simple entry script (ds_version_convert/convert.py) that demonstrates calling the converter with a DatasetContext.

Testing note: I’ve only been able to validate the v3.0 → v2.0 route locally; the other paths still need smoke testing with real datasets. Due to the fast evolving lerobot code base, please always have backups of the datasets. Some of the conversions can be VERY brittle as they tend to depend on deprecated code (some of them I've listed below).

Outdated/removed pieces in upstream lerobot we had to replace

  • lerobot.common.datasets.* and lerobot.common.robot_devices.* packages (replaced with lerobot.datasets.* and lerobot.robots.*).
  • lerobot.datasets.v21.convert_dataset_v20_to_v21 (no longer exists; version constants now defined locally).
  • write_episode_stats, write_jsonlines, get_video_size_in_mb, EPISODES_STATS_PATH, DEFAULT_PARQUET_PATH, TASKS_PATH, EPISODES_PATH (replaced with local implementations or LEGACY_* constants).
  • get_image_pixel_channels helper from lerobot.datasets.video_utils (removed; we now read shapes directly from metadata).

@bingogome bingogome changed the title Feat data converter feat: data converter manager Nov 13, 2025
Copy link
Contributor
@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a flexible LeRobotDatasetVersionConverter and a set of modular pipelines for converting datasets between different versions. The design is solid, using a graph-based approach to find conversion paths and function handlers to encapsulate conversion logic. This greatly improves the maintainability and extensibility of dataset conversions.

I've identified a few critical issues that need to be addressed:

  • A path traversal vulnerability in the v3.0 to v2.1 conversion script that could allow writing files outside the target directory.
  • A bug in the clone_tree utility function that will cause conversions to fail.
  • A TypeError in the v1.6 to v2.0 conversion pipeline due to an incorrect function call.

Additionally, I've pointed out a high-severity issue with hardcoded user-specific paths in the example script, which should be replaced with placeholders or configured via arguments to make it portable.

Once these issues are resolved, this will be a great addition to the project.

Comment on lines +18 to +19
ensure_destination(dest_root, overwrite=overwrite)
shutil.copytree(source_root, dest_root)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The clone_tree function will fail with a FileExistsError. The ensure_destination function creates the destination directory, but shutil.copytree requires the destination directory to not exist. Assuming you are using Python 3.8+, you can fix this by using the dirs_exist_ok=True argument in shutil.copytree.

Suggested change
ensure_destination(dest_root, overwrite=overwrite)
shutil.copytree(source_root, dest_root)
ensure_destination(dest_root, overwrite=overwrite)
shutil.copytree(source_root, dest_root, dirs_exist_ok=True)


metadata_v1 = load_json(source_root / V1_INFO_PATH)
dataset = datasets.load_dataset("parquet", data_dir=source_root / "data", split="train")
features = _build_features(dataset, robot_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The call to _build_features passes two arguments, but the function is defined to only accept one (dataset). This will cause a TypeError when the code is executed. The robot_config parameter seems to be unused in _build_features.

Suggested change
features = _build_features(dataset, robot_config)
features = _build_features(dataset)

Comment on lines +221 to +222
def _validate_video_paths(src: Path, dst: Path) -> None:
"""Validate source and destination paths to prevent security issues."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function is missing a critical security check to prevent path traversal attacks. The video_key used to construct destination paths is derived from info.json and can be controlled by a malicious actor. A video_key like ../../foo could cause ffmpeg to write files outside of the intended destination directory.

You should validate that the destination path is within a safe base directory. This requires passing the new_root directory down to _validate_video_paths and adding a check.

Here are the required changes:

  1. In convert_videos: pass new_root to _extract_video_segment.

    # around line 382
    _extract_video_segment(src_path, dest_path, start=start, end=end, safe_base_dir=new_root)
  2. In _extract_video_segment: update signature to accept safe_base_dir and pass it on.

    # around line 279
    def _extract_video_segment(
        src: Path,
        dst: Path,
        start: float,
        end: float,
        *,
        safe_base_dir: Path,
    ) -> None:
        # Validate paths to prevent security issues
        _validate_video_paths(src, dst, safe_base_dir=safe_base_dir)
        # ... rest of the function
  3. In _validate_video_paths: update signature and add the check.

    # around line 221
    def _validate_video_paths(src: Path, dst: Path, *, safe_base_dir: Path) -> None:
        """Validate source and destination paths to prevent security issues."""
        
        # ...
        try:
            src_resolved = src.resolve(strict=True)
            dst_resolved = dst.resolve()
            safe_base_dir_resolved = safe_base_dir.resolve(strict=True)
        except (OSError, FileNotFoundError) as exc:
            raise ValueError(f"Invalid path provided: {exc}") from exc
        
        # Security: Ensure destination path is within the safe base directory
        if not str(dst_resolved).startswith(str(safe_base_dir_resolved)):
            raise ValueError(f"Path traversal attempt detected. Destination '{dst_resolved}' is outside of safe base directory '{safe_base_dir_resolved}'.")
        
        # ... rest of the function

Comment on lines +27 to +28
repo_name="yihao-brain-bot/test_data_converter",
source=Path("/home/yihao/.cache/huggingface/lerobot/yihao-brain-bot/test_data_converter"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The demonstration script contains hardcoded, user-specific paths. This makes the script not portable and leaks information about the developer's environment. It's better to use command-line arguments or relative paths with placeholders to make the example runnable for other users.

Suggested change
repo_name="yihao-brain-bot/test_data_converter",
source=Path("/home/yihao/.cache/huggingface/lerobot/yihao-brain-bot/test_data_converter"),
repo_name="<your-repo-id>",
source=Path("/path/to/your/source_dataset"),

@bingogome bingogome force-pushed the feat-data-converter branch 2 times, most recently from 45ad9ff to 4e7c436 Compare December 1, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

0