8000 [DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 by fegin · Pull Request #134158 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 #134158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

fegin
Copy link
Contributor
@fegin fegin commented Aug 21, 2024

Stack from ghstack (oldest at bottom):

The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

  1. DCP should record version in the metadata.
  2. DCP should have a nice way to load old state_dict.
  3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

@pradeepfn Please let me know if this summary matches our discussion.

Fixes #133923

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

[ghstack-poisoned]
Copy link
pytorch-bot bot commented Aug 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134158

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 0da4cf7 with merge base 2588b5e (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 21, 2024
fegin added a commit that referenced this pull request Aug 21, 2024
… before 2.4

The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

ghstack-source-id: f8ff2a6
Pull Request resolved: #134158
@fegin fegin requested review from wz337, LucasLLC and pradeepfn August 21, 2024 21:58
@fegin fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/perio 10000 dic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Aug 21, 2024
Copy link
Contributor
@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

[ghstack-poisoned]
fegin added a commit that referenced this pull request Aug 21, 2024
… before 2.4

The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

ghstack-source-id: 1adbc53
Pull Request resolved: #134158
old_keys = set(old_state_dict.keys())
if old_keys & missing_keys:
self.state_dict, self.mappings = old_state_dict, old_mappings
_version._act_like_version = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why we set the

_version._act_like_version = 2.3 and then set the same to None at the end.
Isn't that a noop ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comment.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Aug 26, 2024
… before 2.4

The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

ghstack-source-id: f207aed
Pull Request resolved: #134158
@fegin fegin requested a review from pradeepfn August 27, 2024 06:59
Copy link
Contributor
@pradeepfn pradeepfn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we prevent BC breakages in future. Do you have insights on how to have a test suite that save with old code/load with new code for DCP code changes.

@fegin
Copy link
Contributor Author
fegin commented Aug 28, 2024

@pytorchbot merge -f "The linter error is not from this PR."

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@fegin
Copy link
Contributor Author
fegin commented Aug 28, 2024

How can we prevent BC breakages in future. Do you have insights on how to have a test suite that save with old code/load with new code for DCP code changes.

I think we should add some tests that generate checkpoints for 2.3. And this checkpoints shouldn't be changed (maybe use checksum to verify) regardless how the code base changes. We can then use this golden sample to verify if we can always load the golden sample back.

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
… before 2.4 (pytorch#134158)

The original DCP doesn't flattening all the containers, which can cause issues, pytorch#125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

@pradeepfn Please let me know if this summary matches our discussion.

Fixes pytorch#133923

Pull Request resolved: pytorch#134158
Approved by: https://github.com/wz337, https://github.com/pradeepfn
@github-actions github-actions bot deleted the gh/fegin/287/head branch October 2, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0