-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[DCP] Introduce process based async checkpointing #147039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147039
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 886b352 with merge base d260d4f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D69272583 |
69eec1c
to
a61d390
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
a61d390
to
58a14aa
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
58a14aa
to
bb5571e
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
bb5571e
to
e159ffb
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
e159ffb
to
02f75fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Please fix the Lint errors before landing.
550ffc8
to
f12b57d
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
Summary: ### Context Background checkpoint upload thread interfering with trainer thread: In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. ### Solution: Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime. Test Plan: E2E UTs Reviewed By: saumishr Differential Revision: D69272583
f12b57d
to
9d2b9d2
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
9d2b9d2
to
c6fe649
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
Summary: ### Context Background checkpoint upload thread interfering with trainer thread: In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; 8000 this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. ### Solution: Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime. Test Plan: E2E UTs Reviewed By: saumishr Differential Revision: D69272583
c6fe649
to
85936c1
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
Summary: ### Context Background checkpoint upload thread interfering with trainer thread: In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. ### Solution: Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime. Test Plan: E2E UTs Reviewed By: saumishr Differential Revision: D69272583
85936c1
to
886b352
Compare
This pull request was exported from Phabricator. Differential Revision: D69272583 |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
A bit late to the party but i have a couple of quick comments:
|
Thanks @teja-rao Re 1: We didn't want to enable it by default. We kept the thread based implementation and kept it as a DI so folks don't suddenly see process based async cp. Re 2: Ack on this, thanks for the feedback. Will follow up with the changes. |
Summary:
Context
Background checkpoint upload thread interfering with trainer thread:
In async save API, the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.
Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.
Test Plan: Added E2E UTs for process based async save.
Differential Revision: D69272583
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @LucasLLC @mhorowitz @pradeepfn @ekr0