8000 [CUDA] test_c10d_nccl test_extra_cuda_context failure due to _helper_test_extra_cuda_context_by_memory · Issue #153122 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[CUDA] test_c10d_nccl test_extra_cuda_context failure due to _helper_test_extra_cuda_context_by_memory #153122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nWEIdia opened this issue May 8, 2025 · 2 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: tests Issues related to tests (not the torch.testing module) oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@nWEIdia
Copy link
Collaborator
nWEIdia commented May 8, 2025

While trying to replace cuda11.8 distributed jobs by cuda 12.6 (PR), test_extra_cuda_context failed and I had to increase the 1.5x heuristic to 1.7 to temporarily workaround the failure.

When this is finally fixed, I would roll back the 1.7 to 1.5.

Previously failed job:
https://github.com/pytorch/pytorch/actions/runs/14656019861/job/41132964287

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @msaroufim @eqy @jerryzh168 @mruberry @ZainRizvi @tinglvv @atalman @malfet

@eqy
Copy link
Collaborator
eqy commented May 8, 2025

adding a note for myself that we might want to take a closer look at this one since CUDA 11 -> CUDA 12 did have a lot of issues around extra CUDA context creation

@jbschlosser jbschlosser added oncall: distributed Add this issue/PR to distributed oncall triage queue module: cuda Related to torch.cuda, and CUDA support in general module: tests Issues related to tests (not the torch.testing module) labels May 8, 2025
pytorchmergebot pushed a commit that referenced this issue May 20, 2025
… to cu126-sm75 (#151594)

This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
#153122 CUDA context related
#153517  NCCL regression, future NCCL may fix it

See: #147383

Pull Request resolved: #151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever
pytorchmergebot pushed a commit that referenced this issue May 22, 2025
…12.6 (#151594)

This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
#153122 CUDA context related
#153517  NCCL regression, future NCCL may fix it
#154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: #147383

Pull Request resolved: #151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 23, 2025
63B4

From the discussion here: #154174 (comment)
Looks like we are going to remove this _by_memory test and prefer nvml based function.

I would soon close this issue without further investigation.

cc @ngimel @kwen2501 @atalman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: tests Issues related to tests (not the torch.testing module) oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

3 participants
0