10000 Flight Recorder dump_entries() segfaults when used with coalesced operations · Issue #119758 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Flight Recorder dump_entries() segfaults when used with coalesced operations #119758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wconstab opened this issue Feb 13, 2024 · 14 comments
Open
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@wconstab
Copy link
Contributor
wconstab commented Feb 13, 2024

First observed when using batched_isend_irecv with send/recv P2pOps in PipelineParallel on whc/pp branch (torchtrain). Isolated a minimal repro (#119757)

I suspect this issue applies to all collectives/point2point when the following conditions are true

  1. We record the work into the flight recorder, which includes refs to cuda events
    (always true, more or less)

  2. We don't call workEnqueue because coalescing is active or something
    (I don't understand this logic as well)

   if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) {
    workEnqueue(work);
  } else {
    at::cuda::CUDAGraph::dec_pending_event_queries();
  }

My hypothesis is that we are losing the work object (it's getting destructed) in the path we don't put it in the workMetaList. I would be surprised by this bc I assume something has to keep it alive for coalescing.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @shuqiangzhang

@jbschlosser jbschlosser added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 13, 2024
@kwen2501
Copy link
Contributor

IIRC, the work object would be eventually enqueued in endCoalescing, as one whole work.

@wconstab
Copy link
Contributor Author
wconstab commented Feb 13, 2024

endCoalescing actually creates its own work object and enqueues that. I'm not sure what happens to the original work created inside the PointToPoint call?

I added log messages inside pointToPoint() and endCoalescing on the 2 if/else branches where it either enqueue's work or does some cuda stream decr.

This is an excerpt of my log from the unit test. (it continues longer, with the same pattern, never see work getting enqueued)

[rank0]:[W ProcessGroupNCCL.cpp:2674] pointToPoint created a work                                                                  
[rank0]:[W ProcessGroupNCCL.cpp:2800] pointToPoint skipped enqueueing a work                                                       
[rank0]:[W ProcessGroupNCCL.cpp:2387] endCoalescing created a new work                                                             
[rank0]:[W ProcessGroupNCCL.cpp:2423] endCoalescing didn't enqueue its work    

@shuqiangzhang
Copy link
Contributor
shuqiangzhang commented Feb 13, 2024

Check the comment in endCoalescing:
// TODO: it seems we never enqueue work for single send/recv or batch P2P,

// TODO: it seems we never enqueue work for single send/recv or batch P2P,

it only enqueues for if (coalescing_state_ & CoalColl) is true

@wconstab
Copy link
Contributor Author

it only enqueues for if (coalescing_state_ & CoalColl) is true

i think that's only bug 1 of 2.

bug 2 is that we create 2 separate works. For the first work, we log it to the flight recorder, but then let it expire. Flight recorder steals references to the events and expects the work to stay alive. thats bad.

@kwen2501
Copy link
Contributor

I'm not sure what happens to the original work created inside the PointToPoint call?

We do not create work object for each send/recv op inside a batch_isend_irecv and enqueue them. The whole batch is treated as one work.

@wconstab
Copy link
Contributor Author

According to my logs for my case, we do?

[rank0]:[W ProcessGroupNCCL.cpp:2674] pointToPoint created a work                                                                  
[rank0]:[W ProcessGroupNCCL.cpp:2800] pointToPoint skipped enqueueing a work                                                       
[rank0]:[W ProcessGroupNCCL.cpp:2387] endCoalescing created a new work                                                             
[rank0]:[W ProcessGroupNCCL.cpp:2423] endCoalescing didn't enqueue its work

@kwen2501
Copy link
Contributor
kwen2501 commented Feb 13, 2024

I think this line can be / should be removed from endCoalescing:

if ((coalescing_state_ & CoalColl) &&

It may be wrongly copy&paste from somewhere else.

@wconstab
Copy link
Contributor Author

could you state the design intent for how batch_isend_irecv is supposed to work? (should it even enter pointToPoint function? if so, what should / shouldn't happen at that time, and then what should happen in endCoalescing?)

@kwen2501
Copy link
Contributor
kwen2501 commented Feb 13, 2024

should it even enter pointToPoint function?

I am not a big fan of this helper tbh. In #119421, I propose that both the collective() helper and this one should be gone, and we stop using lambda style.

But let say we enter this helper, what happens?

To a batch_isend_irecv API, this helper makes a call to ncclSend or ncclRecv for respective ops inside the batch. But, ncclSend or ncclRecv would not actually enqueue the CUDA kernel, so no CUDA event recording is valid here. That's why we don't create Work in the pointToPoint helper.

So when can we do the recording? After ncclGroupEnd is called.
That's why the recording should be done in endCoalescing.

@wconstab
Copy link
Contributor Author

we currently call initWork from inside pointToPoint, and initWork always records to the flight recorder.

it's a convenient invariant that we record and enqueue every work we create. (if we have to have exceptions to this rule, then we have to be more careful to not miss any works from recording).

Do we actually need to create the work obj in pointToPoint? or can we kill that path off? (maybe we do have to create it in the case we do not use coalescing, but we don't want to create it if we do use coalescing? in that case it can be complicated..)

@kwen2501
Copy link
Contributor
kwen2501 commented Feb 14, 2024

Agree. Either would work:

  • Skip initWork in pointToPoint, or
  • Move flight recorder's recording to workEnqueue.

Solution 1 may require slightly more work than solution 2.

@kwen2501
Copy link
Contributor

Actually, what would work best depends on what flight recorder wants to record:
does it want to record individual send/recv sizes or would it suffice to record a "lump sum" thing.
From debugging perspective, individual send/recv sizes may be more informational. But then you would need to decouple flight recorder from Work..

@wconstab
Copy link
Contributor Author

I think for the flight recorder it is best to record something close to what executes as one 'kernel'. For individual send/recv sizes, we can add that metadata all into the description of that particular kernel. (we could further abuse profilingTitle, or come up with a better way).

But we have to decouple flight recorder from 'initWork' then i think. However, if 'the right thing' is to not create a work inside pointToPoint, then we don't have to decouple flight recorder from initWork and we can just do the harder but more correct thing of not creating a pointless work there.

@shuqiangzhang
Copy link
Contributor
shuqiangzhang commented Feb 14, 2024

IMHO, better to be always consistent with the one to one mapping of (seq_ <-> flight recorder entry <->work enqueued)

wconstab added a commit that referenced this issue Feb 20, 2024
… combo"


RE 
#119758

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Feb 21, 2024
…ecv dump_entries combo"


RE 
#119758

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Feb 21, 2024
…ecv dump_entries combo"


RE 
#119758

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Feb 21, 2024
… combo"


RE 
#119758

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Feb 23, 2024
…ecv dump_entries combo"


RE 
#119758

Fixes the above issue by 

// Record every work that we enqueue, rather than every work we create.
// - we do not currently enqueue every created work, see coalescing in pointToPoint
// - but it is UNSAFE to steal start/end event refs from works that may go out of scope,
//   and enqueueing in workMetaList is the mechanism by which we ensure they stay in scope
//   long enough for flight recorder to finish using them

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this issue Feb 23, 2024
… combo"


RE 
#119758

Fixes the above issue by 

// Record every work that we enqueue, rather than every work we create.
// - we do not currently enqueue every created work, see coalescing in pointToPoint
// - but it is UNSAFE to steal start/end event refs from works that may go out of scope,
//   and enqueueing in workMetaList is the mechanism by which we ensure they stay in scope
//   long enough for flight recorder to finish using them

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

4 participants
0