[c10d] Fix extra CUDA context created by barrier #149144

kwen2501 · 2025-03-13T18:38:15Z

Stack from ghstack (oldest at bottom):

-> [c10d] Fix extra CUDA context created by barrier #149144

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses device_id given by user
when calling init_process_group.

This PR also uses torch._C._get_accelerator() to determine the device
type.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. [ghstack-poisoned]

pytorch-bot · 2025-03-13T18:38:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149144

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

✅ No Failures

As of commit 731b4cd with merge base e9e1aac ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 258020d Pull Request resolved: #149144

XilunWu

LGTM, thx for the fix!

d4l3k · 2025-03-13T20:08:06Z

torch/distributed/distributed_c10d.py

+    elif group.bound_device_id is not None:
+        # Use device id from `init_process_group(device_id=...)`
+        opts.device = group.bound_device_id
+    elif device.type == "cpu" or get_backend(group) == Backend.GLOO:


Is there a way to avoid depending on specific backend names/types? This makes it hard to add new ones that are compatible with core PT -- I've been trying to clean these up for torchft

Yeah, I hope there is a way. The specific code is for a case where the user is on a GPU machine but only want to use CPU to do some stuff...

d4l3k · 2025-03-13T20:08:59Z

torch/distributed/distributed_c10d.py

@@ -4654,30 +4654,43 @@ def barrier(
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op
-        device_ids ([int], optional): List of device/GPU ids.
+        device_ids ([int], optional): List of device/GPU ids. Only one id is expected.


Can we change this to

Only the first ID is used.

I do mean only one is expected, because now we are expecting one device per thread. Some of the API signatures came from the old days.

d4l3k · 2025-03-13T20:10:07Z

torch/distributed/distributed_c10d.py

+        # Use device id from `init_process_group(device_id=...)`
+        opts.device = group.bound_device_id
+    elif device.type == "cpu" or get_backend(group) == Backend.GLOO:
+        opts.device = torch.device("cpu")


Will Gloo fail if it's not a CPU device?

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 428f13a Pull Request resolved: #149144

kwen2501 · 2025-03-17T17:48:33Z

@pytorchbot merge

pytorchmergebot · 2025-03-17T17:50:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-17T20:08:29Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_9-cuda12_8-test / test

Details for Dev Infra team

Raised by workflow job

kwen2501 · 2025-03-19T17:06:35Z

Failure seems to be an issue of CI instance and unrelated.
@pytorchbot merge -i

pytorchmergebot · 2025-03-19T17:09:04Z

Merge started

Your change will be merged while ignoring the following 1 checks: linux-binary-manywheel / manywheel-py3_9-cuda12_8-test / test

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-19T17:14:51Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 070865253dc02320a0c6bf7395e44556a6f4998c returned non-zero exit code 1

Auto-merging test/distributed/test_c10d_nccl.py
Auto-merging torch/distributed/distributed_c10d.py
CONFLICT (content): Merge conflict in torch/distributed/distributed_c10d.py
error: could not apply 070865253dc... [c10d] Fix extra CUDA context created by barrier
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

cyyever · 2025-03-29T04:00:28Z

@pytorchbot merge -f "Unrelated failures"

pytorchmergebot · 2025-03-29T04:02:03Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-29T04:02:13Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 070865253dc02320a0c6bf7395e44556a6f4998c returned non-zero exit code 1

Auto-merging test/distributed/test_c10d_nccl.py
Auto-merging torch/distributed/distributed_c10d.py
CONFLICT (content): Merge conflict in torch/distributed/distributed_c10d.py
error: could not apply 070865253dc... [c10d] Fix extra CUDA context created by barrier
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 96c32b9 Pull Request resolved: #149144

kwen2501 · 2025-05-03T00:16:44Z

@pytorchbot merge

pytorchmergebot · 2025-05-03T00:19:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 96c32b9 Pull Request resolved: #149144

huydhn · 2025-05-05T22:55:15Z

@pytorchbot revert -m 'Internal failure looks legit' -c ghfirst

E           Traceback of where the remote function was issued on controller (most recent call last):
E             <not related to a specific invocation>
E           Traceback of where the remote function failed on worker (most recent call last):
E             File "<unknown>", line None, in remote function failed: Traceback (most recent call last):
E             File "/re_cwd/buck-out/v2/gen/fbcode/d6a07959eaa8ed59/monarch/python/tests/__test_remote_functions__/test_remote_functions#link-tree/monarch/worker/_testing_function.py", line 101, in barrier
E               dist.barrier(group=group, async_op=False, device_ids=device_ids)
E             File "/re_cwd/buck-out/v2/gen/fbcode/d6a07959eaa8ed59/monarch/python/tests/__test_remote_functions__/test_remote_functions#link-tree/torch/distributed/c10d_logger.py", line 81, in wrapper
E               return func(*args, **kwargs)
E             File "/re_cwd/buck-out/v2/gen/fbcode/d6a07959eaa8ed59/monarch/python/tests/__test_remote_functions__/test_remote_functions#link-tree/torch/distributed/distributed_c10d.py", line 4770, in barrier
E               work = group.barrier(opts=opts)
E           <class 'RuntimeError'>: CUDA error: invalid device ordinal
E           CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E           For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E           Device-side assertion tracking was not enabled by user.
E           RuntimeError: remote function failed: Traceback (most recent call last):
E             File "/re_cwd/buck-out/v2/gen/fbcode/d6a07959eaa8ed59/monarch/python/tests/__test_remote_functions__/test_remote_functions#link-tree/monarch/worker/_testing_function.py", line 101, in barrier
E               dist.barrier(group=group, async_op=False, device_ids=device_ids)
E             File "/re_cwd/buck-out/v2/gen/fbcode/d6a07959eaa8ed59/monarch/python/tests/__test_remote_functions__/test_remote_functions#link-tree/torch/distributed/c10d_logger.py", line 81, in wrapper
E               return func(*args, **kwargs)
E             File "/re_cwd/buck-out/v2/gen/fbcode/d6a07959eaa8ed59/monarch/python/tests/__test_remote_functions__/test_remote_functions#link-tree/torch/distributed/distributed_c10d.py", line 4770, in barrier
E               work = group.barrier(opts=opts)
E           <class 'RuntimeError'>: CUDA error: invalid device ordinal
E           CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E           For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E           Device-side assertion tracking was not enabled by user.

F438

pytorchmergebot · 2025-05-05T22:56:44Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 457fa82. Reverted #149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](#149144 (comment)))

pytorchmergebot · 2025-05-05T22:56:53Z

@kwen2501 your PR has been successfully reverted.

kwen2501 · 2025-05-06T15:25:46Z

@pytorchbot merge -f "Internal test was wrong; OSS version of barrier tests passed"

pytorchmergebot · 2025-05-06T15:27:20Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Refer pytorch/pytorch#149144, Currently, `dist.barrier` accepts `device_ids` as a parameter that doesn't have to be a list. When `device_ids` is not provided or another value is passed, `barrier` will use the device associated with the process group at initialization to perform the synchronization.

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 96c32b9 Pull Request resolved: #149144

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 13, 2025

kwen2501 mentioned this pull request Mar 13, 2025

The device_id parameter of distributed.init_process_group will cause each process to occupy video memory on the first accessible GPU #149119

Closed

kwen2501 requested review from H-Huang, wconstab and fduwjj March 13, 2025 18:39

kwen2501 added the topic: bug fixes topic category label Mar 13, 2025

XilunWu approved these changes Mar 13, 2025

View reviewed changes

d4l3k reviewed Mar 13, 2025

View reviewed changes

fduwjj approved these changes Mar 14, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 17, 2025

pytorchmergebot added the merging label Mar 17, 2025

pytorchmergebot removed the merging label Mar 17, 2025

pytorchmergebot added the merging label Mar 19, 2025

pytorchmergebot removed the merging label Mar 19, 2025

cyyever approved these changes Mar 20, 2025

View reviewed changes

pytorchmergebot added the merging label Mar 29, 2025

pytorchmergebot removed the merging label Mar 29, 2025

pytorchmergebot added the merging label May 3, 2025

pytorchmergebot added the Merged label May 3, 2025

pytorchmergebot closed this in 457fa82 May 3, 2025

pytorchmergebot removed the merging label May 3, 2025

This was referenced May 5, 2025

[c10d] Fix extra CUDA context created by barrier #152834

Merged

[v2.7.1] Release Tracker #152627

Open

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels May 5, 2025

pytorchmergebot reopened this May 5, 2025

pytorchmergebot added the merging label May 6, 2025

pytorchmergebot closed this in a8f727c May 6, 2025

pytorchmergebot removed the merging label May 6, 2025

Chao1Han mentioned this pull request May 7, 2025

Align latest dist.barrier test change intel/torch-xpu-ops#1639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] Fix extra CUDA context created by barrier #149144

[c10d] Fix extra CUDA context created by barrier #149144

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[c10d] Fix extra CUDA context created by barrier #149144

[c10d] Fix extra CUDA context created by barrier #149144

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149144

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!