8000 [inductor] parallel compile: Create new pipes for subproc communication by masnesral · Pull Request #131194 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[inductor] parallel compile: Create new pipes for subproc communication #131194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

masnesral
Copy link
Contributor
@masnesral masnesral commented Jul 19, 2024

Stack from ghstack (oldest at bottom):

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes something in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

Differential Revision: D59968362

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

[ghstack-poisoned]
Copy link
pytorch-bot bot commented Jul 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131194

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7984abb with merge base 3c622fb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…communication"

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Jul 19, 2024
Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

ghstack-source-id: 0929961
Pull Request resolved: #131194
@@ -21,20 +21,6 @@
log = logging.getLogger(__name__)


class Pipe(typing.Protocol):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was really needed. typechecking gives me no errors.

@masnesral masnesral linked an issue Jul 19, 2024 that may be closed by this pull request
@masnesral masnesral requested review from zdevito and eellison July 19, 2024 15:40
@masnesral
Copy link
Contributor Author

I followed @zdevito 's simple_worker_pool approach, forwarded by @eellison

@masnesral masnesral added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 19, 2024
@masnesral
Copy link
Contributor Author

@masnesral has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@masnesral
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

DiweiSun pushed a commit to DiweiSun/pytorch that referenced this pull request Jul 22, 2024
…on (pytorch#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: pytorch#131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Jul 25, 2024
…on (pytorch#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: pytorch#131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman
@atalman
Copy link
Contributor
atalman commented Aug 15, 2024

@pytorchbot cherry-pick --onto release/2.4 -c critical --fixes #131070

pytorchbot pushed a commit that referenced this pull request Aug 15, 2024
…on (#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: #131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman

(cherry picked from commit 3c43fe0)
@pytorchbot
Copy link
Collaborator

Cherry picking #131194

The cherry pick PR is at #133590 and it is linked with issue #131070. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman pushed a commit to atalman/pytorch that referenced this pull request Aug 20, 2024
…on (pytorch#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: pytorch#131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman
atalman pushed a commit to atalman/pytorch that referenced this pull request Aug 20, 2024
…on (pytorch#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: pytorch#131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman

(cherry picked from commit 3c43fe0)
pytorchmergebot pushed a commit that referenced this pull request Aug 20, 2024
During cherry-picking we want to use default setting and fail if there is merge conflict
Here an example of invalid conflict resolution:
#131194
and cherry-pick
#133590

Pull Request resolved: #134047
Approved by: https://github.com/kit1980
@atalman
Copy link
Contributor
atalman commented Aug 20, 2024

@pytorchbot cherry-pick --onto release/2.4 -c critical

@pytorchbot
Copy link
Collaborator

Cherry picking #131194

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 3c43fe068f4c9d25d110106769bccab94da5f352 returned non-zero exit code 1

Auto-merging torch/_inductor/compile_worker/__main__.py
CONFLICT (content): Merge conflict in torch/_inductor/compile_worker/__main__.py
Auto-merging torch/_inductor/compile_worker/subproc_pool.py
error: could not apply 3c43fe068f... [inductor] parallel compile: Create new pipes for subproc communication (#131194)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

atalman added a commit that referenced this pull request Aug 21, 2024
#134042)

* [inductor] parallel compile: Create new pipes for subproc communication (#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: #131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman

* add_catch_statement

* log_fix

---------

Co-authored-by: Sam Larsen <slarsen@meta.com>
pytorch-bot bot pushed a commit that referenced this pull request Sep 13, 2024
During cherry-picking we want to use default setting and fail if there is merge conflict
Here an example of invalid conflict resolution:
#131194
and cherry-pick
#133590

Pull Request resolved: #134047
Approved by: https://github.com/kit1980
@github-actions github-actions bot deleted the gh/masnesral/99/head branch September 23, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subproc exception with torch complie for Torch 2.4.0 and Nightly
6 participants
0