-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 #134158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134158
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit 0da4cf7 with merge base 2588b5e ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
… before 2.4 The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. ghstack-source-id: f8ff2a6 Pull Request resolved: #134158
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… before 2.4 The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. ghstack-source-id: 1adbc53 Pull Request resolved: #134158
old_keys = set(old_state_dict.keys()) | ||
if old_keys & missing_keys: | ||
self.state_dict, self.mappings = old_state_dict, old_mappings | ||
_version._act_like_version = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why we set the
_version._act_like_version = 2.3 and then set the same to None at the end.
Isn't that a noop ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the comment.
… before 2.4 The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. ghstack-source-id: f207aed Pull Request resolved: #134158
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we prevent BC breakages in future. Do you have insights on how to have a test suite that save with old code/load with new code for DCP code changes.
@pytorchbot merge -f "The linter error is not from this PR." |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
I think we should add some tests that generate checkpoints for 2.3. And this checkpoints shouldn't be changed (maybe use checksum to verify) regardless how the code base changes. We can then use this golden sample to verify if we can always load the golden sample back. |
… before 2.4 (pytorch#134158) The original DCP doesn't flattening all the containers, which can cause issues, pytorch#125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes pytorch#133923 Pull Request resolved: pytorch#134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn
Stack from ghstack (oldest at bottom):
The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries.
Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:
This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.
@pradeepfn Please let me know if this summary matches our discussion.
Fixes #133923
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn