8000 [inductor] parallel compile: Create new pipes for subproc communicati… by atalman · Pull Request #134042 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[inductor] parallel compile: Create new pipes for subproc communicati… #134042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 21, 2024

Conversation

atalman
Copy link
Contributor
@atalman atalman commented Aug 20, 2024

…on (#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes something in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: D59968362
Pull Request resolved: #131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalmanFixes #ISSUE_NUMBER

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

masnesral and others added 2 commits August 20, 2024 13:07
…on (pytorch#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: pytorch#131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman
Copy link
pytorch-bot bot commented Aug 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134042

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 70 Unrelated Failures

As of commit 712d8c6 with merge base b66e3f0 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor
@masnesral masnesral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 20, 2024
@atalman atalman merged commit 18736d2 into pytorch:release/2.4 Aug 21, 2024
132 of 204 checks passed
@atalman atalman deleted the cherry_pick_inductor_parallel branch August 21, 2024 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request module: inductor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0