-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[ca][dynamo] always run eager checkpoint region's recomputation in eager #153300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153300
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit a2844d7 with merge base 20ba8fe ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Starting merge as part of PR stack under #152735 |
Starting merge as part of PR stack under #152689 |
Pull Request resolved: #152689 Approved by: https://github.com/bdhirsh ghstack dependencies: #153300
@pytorchbot revert -m "Looks like it breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/fa8543454ab5d3deda1d15c1f8d24e9ebe14f340/1?per_page=50&name_filter=slow%20%2F%20linux-jammy-rocm&mergeEphemeralLF=true" -c nosignal |
1 similar comment
@pytorchbot revert -m "Looks like it breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/fa8543454ab5d3deda1d15c1f8d24e9ebe14f340/1?per_page=50&name_filter=slow%20%2F%20linux-jammy-rocm&mergeEphemeralLF=true" -c nosignal |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit b297e01. Reverted #152689 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/fa8543454ab5d3deda1d15c1f8d24e9ebe14f340/1?per_page=50&name_filter=slow%20%2F%20linux-jammy-rocm&mergeEphemeralLF=true ([comment](#153300 (comment)))
…on in eager (#153300)" This reverts commit 4863e5c. Reverted #153300 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/fa8543454ab5d3deda1d15c1f8d24e9ebe14f340/1?per_page=50&name_filter=slow%20%2F%20linux-jammy-rocm&mergeEphemeralLF=true ([comment](#153300 (comment)))
@xmfan your PR has been successfully reverted. |
The 4 checkpointing |
Starting merge as part of PR stack under #152735 |
Pull Request resolved: #152689 Approved by: https://github.com/bdhirsh ghstack dependencies: #153300
C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized. Pull Request resolved: #152735 Approved by: https://github.com/fegin ghstack dependencies: #153300, #152689
Stack from ghstack (oldest at bottom):
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment
Note: [compiled autograd and checkpoint unpack hook]
.This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.
This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for
TestAutograd.test_checkpointing_without_reentrant_custom_function_works
. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.FIXES #127115
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov