[DSD] Fix to remove non_persistent buffer in distributed state dict #125337

fegin · 2024-05-01T21:11:46Z

Stack from ghstack (oldest at bottom):

Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]

pytorch-bot · 2024-05-01T21:11:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125337

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit bab5c42 with merge base 746da87 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-focal-rocm6.0-py3.8 / test (distributed, 2, 2, linux.rocm.gpu) (gh)
distributed/_tensor/test_attention.py::RingAttentionTest::test_ring_attention_compile_attention_fn1
periodic / win-vs2019-cuda11.8-py3 / test (default, 1, 4, windows.g5.4xlarge.nvidia.gpu) (gh)
'Test'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / win-vs2019-cuda11.8-py3 / test (default, 2, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
profiler\test_profiler.py::TestProfiler::test_basic_chrome_trace

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

awgu · 2024-05-01T23:21:03Z

test/distributed/checkpoint/test_state_dict.py

+            "dont_save_me", torch.rand(100, device="cuda"), persistent=False
+        )
+        ddp_model = DDP(copy.deepcopy(model))
+        set_model_state_dict(ddp_model, get_model_state_dict(ddp_model))


IIUC, set_model_state_dict(module, get_model_state_dict(module)) should be a no-op. Is this just testing that set_model_state_dict() does not error?

Yes, just ensure that there is no error for set when there is non_persistent buffer. The actual value comparison to the single rank model is done below.

awgu · 2024-05-01T23:23:49Z

torch/distributed/checkpoint/state_dict.py

@@ -215,6 +215,8 @@ def recurse(module: nn.Module, curr_fqn: str) -> Generator:
        for name, obj in chain(
            module.named_buffers(recurse=False), module.named_parameters(recurse=False)
        ):
+            if name in module._non_persistent_buffers_set:


I might have missed some discussion. Could you remind me why we use named_buffers() rather than some logic that relies only on the keys in the state dict itself?

It will trigger all_gather for FSDP. Since many users still use FSDP not FSDP2, we will have to ensure no performance penalty for this API.

I see. The issue is that both full and sharded state dict all-gather?

[ghstack-poisoned]

fegin · 2024-05-07T17:54:32Z

@pytorchbot merge -f "The failing tests are not related."

pytorchmergebot · 2024-05-07T17:57:17Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#125337) Summary: Fixes pytorch#122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: pytorch#125337 Approved by: https://github.com/awgu ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336

…125337) (#127219) * [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336 * lintrunner * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

Update

6c44a19

[ghstack-poisoned]

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 1, 2024

fegin requested a review from wz337 May 1, 2024 21:31

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 1, 2024

fegin requested review from awgu and LucasLLC May 1, 2024 21:32

Update

8b05551

[ghstack-poisoned]

awgu reviewed May 1, 2024

View reviewed changes

awgu approved these changes May 2, 2024

View reviewed changes

Update

bab5c42

[ghstack-poisoned]

fegin mentioned this pull request May 3, 2024

[DSD] Improve the performance of distributed state_dict #125501

Closed

pytorchmergebot added the merging label May 7, 2024

pytorchmergebot added the Merged label May 7, 2024

pytorchmergebot closed this in f7d4830 May 7, 2024

pytorchmergebot removed the merging label May 7, 2024

antoinebrl mentioned this pull request May 15, 2024

[v2.3.1] Release Tracker #125425

Closed

antoinebrl mentioned this pull request May 27, 2024

[DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) #127219

Merged

fegin mentioned this pull request May 31, 2024

[DCP] set_model_state_dict errors on compiled module with non-persistent buffer #122792

Closed

github-actions bot deleted the gh/fegin/234/head branch June 7, 2024 01:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DSD] Fix to remove non_persistent buffer in distributed state dict #125337

[DSD] Fix to remove non_persistent buffer in distributed state dict #125337

[DSD] Fix to remove non_persistent buffer in distributed state dict #125337

[DSD] Fix to remove non_persistent buffer in distributed state dict #125337

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125337

❌ 2 New Failures, 1 Unrelated Failure

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Merge started