10000 [DCP] Introduce process based async checkpointing by meetv18 · Pull Request #147039 · pytorch/pytorch · GitHub

[DCP] Introduce process based async checkpointing #147039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

meetv18 wants to merge 1 commit into pytorch:main from meetv18:export-D69272583

+448 −10

Contributor

meetv18 commented

•

Summary:

Context

Background checkpoint upload thread interfering with trainer thread:

In async save API, the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.

Solution:

Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: Added E2E UTs for process based async save.

Differential Revision: D69272583

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @LucasLLC @mhorowitz @pradeepfn @ekr0

pytorch-bot bot commented

•

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147039

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 886b352 with merge base d260d4f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added module: distributed_checkpoint oncall: distributed labels

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

facebook-github-bot added the fb-exported label

meetv18 requested review from daulet-askarov, LucasLLC, pradeepfn, saumishr, teja-rao and fegin

February 12, 2025 23:53

meetv18 commented

View reviewed changes

torch/distributed/checkpoint/state_dict_saver.py Outdated Show resolved Hide resolved

saumishr reviewed

View reviewed changes

torch/distributed/checkpoint/_save_process.py Outdated Show resolved Hide resolved

torch/distributed/checkpoint/state_dict_saver.py Outdated Show resolved Hide resolved

torch/distributed/checkpoint/state_dict_saver.py Outdated Show resolved Hide resolved

torch/distributed/checkpoint/state_dict_saver.py Outdated Show resolved Hide resolved

meetv18 force-pushed the export-D69272583 branch from 69eec1c to a61d390 Compare

February 18, 2025 23:39

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

meetv18 added topic: new features topic: not user facing labels

meetv18 force-pushed the export-D69272583 branch from a61d390 to 58a14aa Compare

February 19, 2025 03:59

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

meetv18 force-pushed the export-D69272583 branch from 58a14aa to bb5571e Compare

February 19, 2025 18:43

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

meetv18 force-pushed the export-D69272583 branch from bb5571e to e159ffb Compare

February 19, 2025 20:09

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

gkroiz reviewed

View reviewed changes

torch/distributed/checkpoint/state_dict_saver.py Outdated Show resolved Hide resolved

meetv18 force-pushed the export-D69272583 branch from e159ffb to 02f75fc Compare

February 20, 2025 00:03

saumishr approved these changes

View reviewed changes

Contributor

saumishr left a comment

Looks good! Please fix the Lint errors before landing.

meetv18 force-pushed the export-D69272583 branch from 550ffc8 to f12b57d Compare

February 25, 2025 18:24

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

meetv18 added a commit to meetv18/pytorch that referenced this pull request


          [DCP] Introduce process based async checkpointing (pytorch#147039)

9d2b9d2

Summary:

### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. 

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: E2E UTs

Reviewed By: saumishr

Differential Revision: D69272583

meetv18 force-pushed the export-D69272583 branch from f12b57d to 9d2b9d2 Compare

February 25, 2025 18:25

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

meetv18 added the oncall: distributed checkpointing label

janeyx99 added release notes: distributed (checkpoint) and removed module: distributed_checkpoint labels

meetv18 force-pushed the export-D69272583 branch from 9d2b9d2 to c6fe649 Compare

March 3, 2025 19:13

pytorch-bot bot added the module: distributed_checkpoint label

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

meetv18 added a commit to meetv18/pytorch that referenced this pull request


          [DCP] Introduce process based async checkpointing (pytorch#147039)

85936c1

Summary:

### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation;
8000
 this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. 

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: E2E UTs

Reviewed By: saumishr

Differential Revision: D69272583

meetv18 force-pushed the export-D69272583 branch from c6fe649 to 85936c1 Compare

March 3, 2025 19:17

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583


          [DCP] Introduce process based async checkpointing (pytorch#147039)

886b352

Summary:

### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration. 

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: E2E UTs

Reviewed By: saumishr

Differential Revision: D69272583

meetv18 force-pushed the export-D69272583 branch from 85936c1 to 886b352 Compare

March 3, 2025 20:15

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D69272583

Contributor

facebook-github-bot commented

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

fdee607

pytorchmergebot removed the merging label

Contributor

teja-rao commented

•

A bit late to the party but i have a couple of quick comments:

why do we need thread based checkpointer given process based one out performs thread based impl?
switch to using multiprocessing.Pipe from queue. so the recv on the pipe does not block if the sub-process died abruptly. subprocess can die abruptly without running finally blocks on SIGTERM/SIGINT/SIGKILL etc.

Contributor Author

meetv18 commented

Thanks @teja-rao

Re 1:

We didn't want to enable it by default. We kept the thread based implementation and kept it as a DI so folks don't suddenly see process based async cp.

Re 2:

Ack on this, thanks for the feedback. Will follow up with the changes.

meetv18 removed the topic: not user facing label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

daulet-askarov daulet-askarov left review comments

gkroiz gkroiz left review comments

saumishr saumishr approved these changes

LucasLLC Awaiting requested review from LucasLLC

pradeepfn Awaiting requested review from pradeepfn

teja-rao Awaiting requested review from teja-rao

fegin Awaiting requested review from fegin

Labels

ciflow/trunk fb-exported Merged oncall: distributed checkpointing oncall: distributed release notes: distributed (checkpoint) topic: new features

0