8000 [training] Adding NUMA support for pytorch by efiks · Pull Request #150597 · pytorch/pytorch · GitHub

[training] Adding NUMA support for pytorch #150597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

efiks wants to merge 1 commit into pytorch:main from efiks:export-D72321369

+17 −1

Contributor

efiks commented

•

Test Plan:
build and run tests for modified libraries locally

buck2 build arvr/mode/platform010/opt //xplat/caffe2:pytorch_ovrsource
buck run arvr/mode/win/debug-md -c python.package_style=inplace //xplat/caffe2:pytorch_test_ovrsource
buck test arvr/mode/linux/opt -c python.package_style=inplace //xplat/caffe2:pytorch_test_ovrsource
buck test mode/opt //caffe2/fb/test:_utils_internal_test

Differential Revision: D72321369

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot bot commented

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @efiks, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

pytorch-bot bot added oncall: distributed release notes: distributed (c10d) labels

pytorch-bot bot commented

•

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150597

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1348578 with merge base 52d172e ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/distributed/distributed_c10d.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369

facebook-github-bot added the fb-exported label

efiks force-pushed the export-D72321369 branch from 05aaf9b to c95f1d8 Compare

April 3, 2025 04:03

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369

efiks force-pushed the export-D72321369 branch from c95f1d8 to 7936f0d Compare

April 3, 2025 04:23

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369

efiks force-pushed the export-D72321369 branch from 7936f0d to bdc08d5 Compare

April 3, 2025 05:22

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369

efiks force-pushed the export-D72321369 branch from bdc08d5 to 96ed012 Compare

April 3, 2025 05:32

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369

efiks force-pushed the export-D72321369 branch from 96ed012 to f08f5eb Compare

April 3, 2025 13:57

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369

efiks force-pushed the export-D72321369 branch from f08f5eb to 6de381c Compare

April 3, 2025 18:35

efiks added a commit to efiks/pytorch that referenced this pull request


          [training] Adding NUMA support for pytorch (pytorch#150597)

6de381c

Summary:

Add entry point and environment variable to control NUMA binding / assigment for distributed pytorch runs (training).

Test Plan: build and run tests for modified libraries locally

Differential Revision: D72321369

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369

efiks mentioned this pull request

Adding NUMA support for pytorch pytorch/FBGEMM#3926

Open

efiks added a commit to efiks/FBGEMM that referenced this pull request


          Adding NUMA support for pytorch

849dce4

Summary:
X-link: facebookresearch/FBGEMM#1014

X-link: pytorch/pytorch#150597

Add entry point and environment variable to control NUMA binding / assigment for distributed pytorch runs (training).

Differential Revision: D72321369

efiks added a commit to efiks/FBGEMM that referenced this pull request


          Adding NUMA support for pytorch (pytorch#3926)

6b3bdd1

Summary:

X-link: facebookresearch/FBGEMM#1014

X-link: pytorch/pytorch#150597

Add entry point and environment variable to control NUMA binding / assigment for distributed pytorch runs (training).

Differential Revision: D72321369

efiks force-pushed the export-D72321369 branch from 6de381c to c43f56e Compare

April 5, 2025 14:59

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369


          [training] Adding NUMA support for pytorch (pytorch#150597)

Summary:
X-link: pytorch/FBGEMM#3926

X-link: facebookresearch/FBGEMM#1014


Add entry point and environment variable to control NUMA binding / assigment for distributed pytorch runs (training).

Test Plan: build and run tests for modified libraries locally

Differential Revision: D72321369

efiks force-pushed the export-D72321369 branch from c43f56e to 1348578 Compare

April 8, 2025 15:54

Contributor

facebook-github-bot commented

This pull request was exported from Phabricator. Differential Revision: D72321369

kwen2501 reviewed

View reviewed changes

torch/distributed/distributed_c10d.py

Comment on lines +1675 to +1682

    
                  # If device_id is provide try to bind current process to the

                  # NUMA node attached to the device

                  if device_id is not None:

                      maybe_enable_numa_binding(device_type=device_id.type, device_id=device_id.index)

                  else:

                      maybe_enable_numa_binding()

Contributor

kwen2501

Just curious - would it be a bit late to do binding here?

Contributor

kwen2501 commented

Relates to RFC #148689, alternative to PR #149334

Contributor

github-actions bot commented

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions bot added the Stale label

github-actions bot closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fb-exported oncall: distributed release notes: distributed (c10d) Stale

0