[inductor] parallel compile: Create new pipes for subproc communication #131194

masnesral · 2024-07-19T15:29:29Z

Stack from ghstack (oldest at bottom):

-> [inductor] parallel compile: Create new pipes for subproc communication #131194

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes something in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

Differential Revision: D59968362

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. [ghstack-poisoned]

pytorch-bot · 2024-07-19T15:29:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131194

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7984abb with merge base 3c622fb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…communication" Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. ghstack-source-id: 0929961 Pull Request resolved: #131194

masnesral · 2024-07-19T15:38:16Z

torch/_inductor/compile_worker/subproc_pool.py

@@ -21,20 +21,6 @@
 log = logging.getLogger(__name__)


-class Pipe(typing.Protocol):


I don't think this was really needed. typechecking gives me no errors.

masnesral · 2024-07-19T15:41:31Z

I followed @zdevito 's simple_worker_pool approach, forwarded by @eellison

masnesral · 2024-07-19T16:37:37Z

@masnesral has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

masnesral · 2024-07-20T02:15:49Z

@pytorchbot merge

pytorchmergebot · 2024-07-20T02:17:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…on (pytorch#131194) Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362) Pull Request resolved: pytorch#131194 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman

atalman · 2024-08-15T17:43:52Z

@pytorchbot cherry-pick --onto release/2.4 -c critical --fixes #131070

…on (#131194) Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362) Pull Request resolved: #131194 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman (cherry picked from commit 3c43fe0)

pytorchbot · 2024-08-15T17:48:26Z

Cherry picking #131194

The cherry pick PR is at #133590 and it is linked with issue #131070. The following tracker issues are updated:

[v2.4.1] Release Tracker #132400 (comment)

Details for Dev Infra team

Raised by workflow job

…on (pytorch#131194) Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362) Pull Request resolved: pytorch#131194 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman

…on (pytorch#131194) Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362) Pull Request resolved: pytorch#131194 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman (cherry picked from commit 3c43fe0)

During cherry-picking we want to use default setting and fail if there is merge conflict Here an example of invalid conflict resolution: #131194 and cherry-pick #133590 Pull Request resolved: #134047 Approved by: https://github.com/kit1980

atalman · 2024-08-20T22:24:30Z

@pytorchbot cherry-pick --onto release/2.4 -c critical

pytorchbot · 2024-08-20T22:29:05Z

Cherry picking #131194

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 3c43fe068f4c9d25d110106769bccab94da5f352 returned non-zero exit code 1

Auto-merging torch/_inductor/compile_worker/__main__.py
CONFLICT (content): Merge conflict in torch/_inductor/compile_worker/__main__.py
Auto-merging torch/_inductor/compile_worker/subproc_pool.py
error: could not apply 3c43fe068f... [inductor] parallel compile: Create new pipes for subproc communication (#131194)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

#134042) * [inductor] parallel compile: Create new pipes for subproc communication (#131194) Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362) Pull Request resolved: #131194 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman * add_catch_statement * log_fix --------- Co-authored-by: Sam Larsen <slarsen@meta.com>

During cherry-picking we want to use default setting and fail if there is merge conflict Here an example of invalid conflict resolution: #131194 and cherry-pick #133590 Pull Request resolved: #134047 Approved by: https://github.com/kit1980

masnesral mentioned this pull request Jul 19, 2024

Use inductor TestCase for test_replicate_with_compiler.py #131193

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Jul 19, 2024

masnesral added the topic: not user facing topic category label Jul 19, 2024

masnesral commented Jul 19, 2024

View reviewed changes

masnesral linked an issue Jul 19, 2024 that may be closed by this pull request

Subproc exception with torch complie for Torch 2.4.0 and Nightly #131070

Closed

masnesral requested review from zdevito and eellison July 19, 2024 15:40

malfet approved these changes Jul 19, 2024

View reviewed changes

eellison approved these changes Jul 19, 2024

View reviewed changes

atalman approved these changes Jul 19, 2024

View reviewed changes

masnesral added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 19, 2024

pytorchmergebot added the merging label Jul 20, 2024

pytorchmergebot added the Merged label Jul 20, 2024

pytorchmergebot closed this in 3c43fe0 Jul 20, 2024

pytorchmergebot removed the merging label Jul 20, 2024

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

pytorchbot mentioned this pull request Aug 15, 2024

[inductor] parallel compile: Create new pipes for subproc communication #133590

Closed

pytorchbot mentioned this pull request Aug 15, 2024

[v2.4.1] Release Tracker #132400

Closed

atalman mentioned this pull request Aug 20, 2024

[inductor] parallel compile: Create new pipes for subproc communicati… #134042

Merged

atalman mentioned this pull request Aug 20, 2024

Cherry-Picking don't resolve conflicts #134047

Closed

github-actions bot deleted the gh/masnesral/99/head branch September 23, 2024 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] parallel compile: Create new pipes for subproc communication #131194

[inductor] parallel compile: Create new pipes for subproc communication #131194

		@@ -21,20 +21,6 @@
		log = logging.getLogger(__name__)


		class Pipe(typing.Protocol):

[inductor] parallel compile: Create new pipes for subproc communication #131194

[inductor] parallel compile: Create new pipes for subproc communication #131194

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131194

✅ No Failures

Choose a reason for hiding this comment

Merge started

Cherry picking #131194

Cherry picking #131194