8000 [training] Adding NUMA support for pytorch by efiks · Pull Request #150597 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@efiks
Copy link
Contributor
@efiks efiks commented Apr 3, 2025

Test Plan:
build and run tests for modified libraries locally

buck2 build arvr/mode/platform010/opt //xplat/caffe2:pytorch_ovrsource
buck run arvr/mode/win/debug-md -c python.package_style=inplace //xplat/caffe2:pytorch_test_ovrsource
buck test arvr/mode/linux/opt -c python.package_style=inplace //xplat/caffe2:pytorch_test_ovrsource
buck test mode/opt //caffe2/fb/test:_utils_internal_test

Differential Revision: D72321369

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot
Copy link
pytorch-bot bot commented Apr 3, 2025

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @efiks, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 3, 2025
@pytorch-bot
Copy link
pytorch-bot bot commented Apr 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150597

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1348578 with merge base 52d172e (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

@efiks efiks force-pushed the export-D72321369 branch from c95f1d8 to 7936f0d Compare April 3, 2025 04:23
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

@efiks efiks force-pushed the export-D72321369 branch from 7936f0d to bdc08d5 Compare April 3, 2025 05:22
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

@efiks efiks force-pushed the export-D72321369 branch from bdc08d5 to 96ed012 Compare April 3, 2025 05:32
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

@efiks efiks force-pushed the export-D72321369 branch from 96ed012 to f08f5eb Compare April 3, 2025 13:57
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

@efiks efiks force-pushed the export-D72321369 branch from f08f5eb to 6de381c Compare April 3, 2025 18:35
efiks added a commit to efiks/pytorch that referenced this pull request Apr 3, 2025
Summary:

Add entry point and environment variable to control NUMA binding / assigment for distributed pytorch runs (training).

Test Plan: build and run tests for modified libraries locally

Differential Revision: D72321369
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

efiks added a commit to efiks/FBGEMM that referenced this pull request Apr 3, 2025
Summary:
X-link: facebookresearch/FBGEMM#1014

X-link: pytorch/pytorch#150597

Add entry point and environment variable to control NUMA binding / assigment for distributed pytorch runs (training).

Differential Revision: D72321369
efiks added a commit to efiks/FBGEMM that referenced this pull request Apr 5, 2025
Summary:

X-link: facebookresearch/FBGEMM#1014

X-link: pytorch/pytorch#150597

Add entry point and environment variable to control NUMA binding / assigment for distributed pytorch runs (training).

Differential Revision: D72321369
@efiks efiks force-pushed the export-D72321369 branch from 6de381c to c43f56e Compare April 5, 2025 14:59
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

Summary:
X-link: pytorch/FBGEMM#3926

X-link: facebookresearch/FBGEMM#1014


Add entry point and environment variable to control NUMA binding / assigment for distributed pytorch runs (training).

Test Plan: build and run tests for modified libraries locally

Differential Revision: D72321369
@efiks efiks force-pushed the export-D72321369 branch from c43f56e to 1348578 Compare April 8, 2025 15:54
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72321369

Comment on lines +1675 to +1682

# If device_id is provide try to bind current process to the
# NUMA node attached to the device
if device_id is not None:
maybe_enable_numa_binding(device_type=device_id.type, device_id=device_id.index)
else:
maybe_enable_numa_binding()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - would it be a bit late to do binding here?

@kwen2501
Copy link
Contributor

Relates to RFC #148689, alternative to PR #149334

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jun 28, 2025
@github-actions github-actions bot closed this Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fb-exported oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0