8000 [DCP] Always flatten mapping even if no tensors present by fegin · Pull Request #125335 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[DCP] Always flatten mapping even if no tensors present #125335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

fegin
Copy link
Contributor
@fegin fegin commented May 1, 2024

Stack from ghstack (oldest at bottom):

Summary:
Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict.

We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 8000 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]
Copy link
pytorch-bot bot commented May 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125335

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit ffcee65 with merge base 746da87 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@fegin fegin changed the title [DCP] Always unflatten containers even if no tensors present [DCP] Always unflatten mapping even if no tensors present May 1, 2024
@fegin fegin changed the title [DCP] Always unflatten mapping even if no tensors present [DCP] Always flatten mapping even if no tensors present May 2, 2024
[ghstack-poisoned]
Copy link
Contributor
@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor
@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fegin
Copy link
Contributor Author
fegin commented May 7, 2024

@pytorchbot merge -f "The failing tests are not related"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: #125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
…125337)

Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: #125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336
mvpatel2000 pushed a commit to mvpatel2000/pytorch that referenced this pull request May 17, 2024
Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: pytorch#125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335
atalman added a commit that referenced this pull request May 27, 2024
* [DSD] Correctly handle _extra_state (#125336)

Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: #125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335

* lint

* lint

---------

Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
antoinebrl pushed a commit to antoinebrl/pytorch that referenced this pull request May 27, 2024
…ytorch#125337)

Summary:
Fixes pytorch#122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: pytorch#125337
Approved by: https://github.com/awgu
ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336
huydhn pushed a commit that referenced this pull request May 27, 2024
…125337) (#127219)

* [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337)

Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: #125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336

* lintrunner

* lint

---------

Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
@github-actions github-actions bot deleted the gh/fegin/232/head branch June 7, 2024 01:55
fegin added a commit that referenced this pull request Aug 21, 2024
… before 2.4

The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

ghstack-source-id: f8ff2a6
Pull Request resolved: #134158
fegin added a commit that referenced this pull request Aug 21, 2024
… before 2.4

The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

ghstack-source-id: 1adbc53
Pull Request resolved: #134158
fegin added a commit that referenced this pull request Aug 26, 2024
… before 2.4

The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

ghstack-source-id: f207aed
Pull Request resolved: #134158
pytorchmergebot pushed a commit that referenced this pull request Aug 28, 2024
… before 2.4 (#134158)

The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

@pradeepfn Please let me know if this summary matches our discussion.

Fixes #133923

Pull Request resolved: #134158
Approved by: https://github.com/wz337, https://github.com/pradeepfn
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
… before 2.4 (pytorch#134158)

The original DCP doesn't flattening all the containers, which can cause issues, pytorch#125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

@pradeepfn Please let me know if this summary matches our discussion.

Fixes pytorch#133923

Pull Request resolved: pytorch#134158
Approved by: https://github.com/wz337, https://github.com/pradeepfn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0