[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used #148825

fegin · 2025-03-08T19:38:27Z

Stack from ghstack (oldest at bottom):

-> [DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used #148825

Summary:
As title.

cc @H-Huang @awgu @kwen2501 @wanchaol @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2025-03-08T19:38:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148825

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4537063 with merge base 915eb01 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… flattening FQNs are used Summary: As title. ghstack-source-id: 3c05da1 Pull Request resolved: #148825

[ghstack-poisoned]

… flattening FQNs are used Summary: As title. ghstack-source-id: 73f670d Pull Request resolved: #148825

kwen2501 · 2025-03-10T15:49:46Z

torch/distributed/checkpoint/state_dict.py

+                if fqn in info.shared_params_mapping:
+                    in_params = False
+                    for k in param_group.keys():
+                        if k == _PARAMS:
+                            continue
+                        flatten_key = f"{_PG}.{fqn}.{k}"
+                        if flatten_key in state_dict:
+                            in_params = True
+                        break
+                else:
+                    in_params = True
+
+                if not in_params:
+                    continue
+


nit: shall we add some comments to the code?

fduwjj

unblock

[ghstack-poisoned]

… flattening FQNs are used Summary: As title. ghstack-source-id: 7f12689 Pull Request resolved: #148825

fegin · 2025-03-10T17:23:16Z

@pytorchbot merge

pytorchmergebot · 2025-03-10T17:25:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…state_dict (#148918) Summary: Fixes #140898 Pull Request resolved: #148918 Approved by: https://github.com/fduwjj, https://github.com/mori360 ghstack dependencies: #148825

Update

b96ab04

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels Mar 8, 2025

fegin added a commit that referenced this pull request Mar 8, 2025

[DSD] Fix the shared parameter mismatch for optimizer state_dict when…

b7f469c

… flattening FQNs are used Summary: As title. ghstack-source-id: 3c05da1 Pull Request resolved: #148825

Update

70a96e3

[ghstack-poisoned]

fegin added a commit that referenced this pull request Mar 8, 2025

[DSD] Fix the shared parameter mismatch for optimizer state_dict when…

3e7665d

… flattening FQNs are used Summary: As title. ghstack-source-id: 73f670d Pull Request resolved: #148825

fegin mentioned this pull request Mar 8, 2025

Support Gemma2 in torchtitan pytorch/torchtitan#594

Closed

fegin requested a review from mori360 March 8, 2025 22:39

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 10, 2025

kwen2501 reviewed Mar 10, 2025

View reviewed changes

fduwjj approved these changes Mar 10, 2025

View reviewed changes

Update

4537063

[ghstack-poisoned]

fegin added a commit that referenced this pull request Mar 10, 2025

[DSD] Fix the shared parameter mismatch for optimizer state_dict when…

a01b9c7

… flattening FQNs are used Summary: As title. ghstack-source-id: 7f12689 Pull Request resolved: #148825

pytorchmergebot added the merging label Mar 10, 2025

mori360 approved these changes Mar 10, 2025

View reviewed changes

pytorchmergebot added the Merged label Mar 10, 2025

pytorchmergebot closed this in ed969d1 Mar 10, 2025

pytorchmergebot removed the merging label Mar 10, 2025

yzhangcs mentioned this pull request Mar 14, 2025

Cannot resumed training from compiled model: Missing key in checkpoint state_dict: model.layers.0.attention.wq.weight. pytorch/torchtitan#961

Closed

github-actions bot deleted the gh/fegin/298/head branch April 12, 2025 02:11

flxst mentioned this pull request Jul 8, 2025

Migration to latest versions of torch & flash-attn to solve warmstart/fsdp2/weight tying problem Modalities/modalities#384

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used #148825

[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used #148825

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used #148825

[DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used #148825

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148825

✅ No Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!