8000 [PTD BE DAY]Burn Down Distributed Disabled Tests!! · Issue #132845 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[PTD BE DAY]Burn Down Distributed Disabled Tests!! #132845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
31 of 59 tasks
wz337 opened this issue Aug 7, 2024 · 2 comments
Open
31 of 59 tasks

[PTD BE DAY]Burn Down Distributed Disabled Tests!! #132845

wz337 opened this issue Aug 7, 2024 · 2 comments
Assignees
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@wz337
Copy link
Contributor
wz337 commented Aug 7, 2024

Hey, Folks! We have 59 flaky distributed tests. I have grouped them into a few categories below. We have some flaky tests from today and dated back to Jan 11, 2019. Let's see if we can burn them down or deprecate the test no longer useful.

If you would like to take an issue, please check the box on this page and assign the issue to yourself. Thanks!

C10D

NCCL

GLOO

Tcpstore

MultiProcessing

MultiThreadedTestCase

RPC

DeviceMesh and DTensor

DeviceMesh

DTensor

For some of the op tests below, @awgu did a bit digging and there might be some issues in our hashing/caching. More context in #132114

DDP, FSDP, PiPPy

FSDP1

FSDP2

DDP

pipeline

Other

DCP

Functional Collectives

ShardedTensor

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wconstab @d4l3k @c-p-i-o

@wz337 wz337 changed the title [PTD]Burn Down Distributed Disabled Tests!! [PTD BE DAY]Burn Down Distributed Disabled Tests!! Aug 7, 2024
@wz337 wz337 self-assigned this Aug 7, 2024
@mikaylagawarecki mikaylagawarecki added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 7, 2024
@awgu
Copy link
Collaborator
awgu commented Aug 7, 2024

I think since #123726 is only disabled on rocm, it might be hard for us to debug. There seem to be some distributed (mainly FSDP) tests disabled on rocm, possibly due to stream overlapping issues. Maybe we can just leave that one be.

@awgu awgu added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 7, 2024
@wz337
Copy link
Contributor Author
wz337 commented Aug 12, 2024

I think since #123726 is only disabled on rocm, it might be hard for us to debug. There seem to be some distributed (mainly FSDP) tests disabled on rocm, possibly due to stream overlapping issues. Maybe we can just leave that one be.

Oh is it ok for me to add a SkipIfRocm so we can close it? #132975

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants
0