10000 [DCP] Introduce process based async checkpointing by meetv18 · Pull Request #147039 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[DCP] Introduce process based async checkpointing #147039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

meetv18
Copy link
Contributor
@meetv18 meetv18 commented Feb 12, 2025

Summary:

Context

Background checkpoint upload thread interfering with trainer thread:

In async save API, the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.

Solution:

Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: Added E2E UTs for process based async save.

Differential Revision: D69272583

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @LucasLLC @mhorowitz @pradeepfn @ekr0

Copy link
pytorch-bot bot commented Feb 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147039

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 886b352 with merge base d260d4f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 12, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

Copy link
Contributor
@saumishr saumishr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Please fix the Lint errors before landing.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

meetv18 added a commit to meetv18/pytorch that referenced this pull request Feb 25, 2025
Summary:

### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. 

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: E2E UTs

Reviewed By: saumishr

Differential Revision: D69272583
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

@meetv18 meetv18 added the oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. label Feb 25, 2025
@meetv18 meetv18 force-pushed the export-D69272583 branch from 9d2b9d2 to c6fe649 Compare March 3, 2025 19:13
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

meetv18 added a commit to meetv18/pytorch that referenced this pull request Mar 3, 2025
Summary:

### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation;
8000
 this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. 

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: E2E UTs

Reviewed By: saumishr

Differential Revision: D69272583
@meetv18 meetv18 force-pushed the export-D69272583 branch from c6fe649 to 85936c1 Compare March 3, 2025 19:17
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

Summary:

### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. 

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: E2E UTs

Reviewed By: saumishr

Differential Revision: D69272583
@meetv18 meetv18 force-pushed the export-D69272583 branch from 85936c1 to 886b352 Compare March 3, 2025 20:15
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69272583

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@teja-rao
Copy link
Contributor
teja-rao commented Mar 6, 2025

A bit late to the party but i have a couple of quick comments:

  1. why do we need thread based checkpointer given process based one out performs thread based impl?

  2. switch to using multiprocessing.Pipe from queue. so the recv on the pipe does not block if the sub-process died abruptly. subprocess can die abruptly without running finally blocks on SIGTERM/SIGINT/SIGKILL etc.

@meetv18
Copy link
Contributor Author
meetv18 commented Mar 6, 2025

Thanks @teja-rao

Re 1:

We didn't want to enable it by default. We kept the thread based implementation and kept it as a DI so folks don't suddenly see process based async cp.

Re 2:

Ack on this, thanks for the feedback. Will follow up with the changes.

@meetv18 meetv18 removed the topic: not user facing topic category label Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) topic: new features topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0