Flight Recorder dump_entries() segfaults when used with coalesced operations #119758

wconstab · 2024-02-13T06:13:24Z

First observed when using batched_isend_irecv with send/recv P2pOps in PipelineParallel on whc/pp branch (torchtrain). Isolated a minimal repro (#119757)

I suspect this issue applies to all collectives/point2point when the following conditions are true

We record the work into the flight recorder, which includes refs to cuda events
(always true, more or less)
We don't call workEnqueue because coalescing is active or something
(I don't understand this logic as well)

   if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) {
    workEnqueue(work);
  } else {
    at::cuda::CUDAGraph::dec_pending_event_queries();
  }

My hypothesis is that we are losing the work object (it's getting destructed) in the path we don't put it in the workMetaList. I would be surprised by this bc I assume something has to keep it alive for coalescing.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @shuqiangzhang

The text was updated successfully, but these errors were encountered:

kwen2501 · 2024-02-13T18:36:26Z

IIRC, the work object would be eventually enqueued in endCoalescing, as one whole work.

wconstab · 2024-02-13T18:42:50Z

endCoalescing actually creates its own work object and enqueues that. I'm not sure what happens to the original work created inside the PointToPoint call?

I added log messages inside pointToPoint() and endCoalescing on the 2 if/else branches where it either enqueue's work or does some cuda stream decr.

This is an excerpt of my log from the unit test. (it continues longer, with the same pattern, never see work getting enqueued)

[rank0]:[W ProcessGroupNCCL.cpp:2674] pointToPoint created a work                                                                  
[rank0]:[W ProcessGroupNCCL.cpp:2800] pointToPoint skipped enqueueing a work                                                       
[rank0]:[W ProcessGroupNCCL.cpp:2387] endCoalescing created a new work                                                             
[rank0]:[W ProcessGroupNCCL.cpp:2423] endCoalescing didn't enqueue its work

shuqiangzhang · 2024-02-13T19:00:16Z

Check the comment in endCoalescing:
// TODO: it seems we never enqueue work for single send/recv or batch P2P,

pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Line 2282 in c0e5cca

// TODO: it seems we never enqueue work for single send/recv or batch P2P,

it only enqueues for if (coalescing_state_ & CoalColl) is true

wconstab · 2024-02-13T19:11:28Z

it only enqueues for if (coalescing_state_ & CoalColl) is true

i think that's only bug 1 of 2.

bug 2 is that we create 2 separate works. For the first work, we log it to the flight recorder, but then let it expire. Flight recorder steals references to the events and expects the work to stay alive. thats bad.

kwen2501 · 2024-02-13T19:20:44Z

I'm not sure what happens to the original work created inside the PointToPoint call?

We do not create work object for each send/recv op inside a batch_isend_irecv and enqueue them. The whole batch is treated as one work.

wconstab · 2024-02-13T19:29:59Z

According to my logs for my case, we do?

[rank0]:[W ProcessGroupNCCL.cpp:2674] pointToPoint created a work                                                                  
[rank0]:[W ProcessGroupNCCL.cpp:2800] pointToPoint skipped enqueueing a work                                                       
[rank0]:[W ProcessGroupNCCL.cpp:2387] endCoalescing created a new work                                                             
[rank0]:[W ProcessGroupNCCL.cpp:2423] endCoalescing didn't enqueue its work

kwen2501 · 2024-02-13T19:32:51Z

I think this line can be / should be removed from endCoalescing:

if ((coalescing_state_ & CoalColl) &&

It may be wrongly copy&paste from somewhere else.

wconstab · 2024-02-13T19:34:35Z

could you state the design intent for how batch_isend_irecv is supposed to work? (should it even enter pointToPoint function? if so, what should / shouldn't happen at that time, and then what should happen in endCoalescing?)

kwen2501 · 2024-02-13T19:44:46Z

should it even enter pointToPoint function?

I am not a big fan of this helper tbh. In #119421, I propose that both the collective() helper and this one should be gone, and we stop using lambda style.

But let say we enter this helper, what happens?

To a batch_isend_irecv API, this helper makes a call to ncclSend or ncclRecv for respective ops inside the batch. But, ncclSend or ncclRecv would not actually enqueue the CUDA kernel, so no CUDA event recording is valid here. That's why we don't create Work in the pointToPoint helper.

So when can we do the recording? After ncclGroupEnd is called.
That's why the recording should be done in endCoalescing.

wconstab · 2024-02-13T20:51:54Z

we currently call initWork from inside pointToPoint, and initWork always records to the flight recorder.

it's a convenient invariant that we record and enqueue every work we create. (if we have to have exceptions to this rule, then we have to be more careful to not miss any works from recording).

Do we actually need to create the work obj in pointToPoint? or can we kill that path off? (maybe we do have to create it in the case we do not use coalescing, but we don't want to create it if we do use coalescing? in that case it can be complicated..)

kwen2501 · 2024-02-14T00:57:28Z

Agree. Either would work:

Skip initWork in pointToPoint, or
Move flight recorder's recording to workEnqueue.

Solution 1 may require slightly more work than solution 2.

kwen2501 · 2024-02-14T03:22:34Z

Actually, what would work best depends on what flight recorder wants to record:
does it want to record individual send/recv sizes or would it suffice to record a "lump sum" thing.
From debugging perspective, individual send/recv sizes may be more informational. But then you would need to decouple flight recorder from Work..

wconstab · 2024-02-14T05:04:06Z

I think for the flight recorder it is best to record something close to what executes as one 'kernel'. For individual send/recv sizes, we can add that metadata all into the description of that particular kernel. (we could further abuse profilingTitle, or come up with a better way).

But we have to decouple flight recorder from 'initWork' then i think. However, if 'the right thing' is to not create a work inside pointToPoint, then we don't have to decouple flight recorder from initWork and we can just do the harder but more correct thing of not creating a pointless work there.

shuqiangzhang · 2024-02-14T17:18:25Z

IMHO, better to be always consistent with the one to one mapping of (seq_ <-> flight recorder entry <->work enqueued)

… combo" RE #119758 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

…ecv dump_entries combo" RE #119758 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

… combo" RE #119758 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

…ecv dump_entries combo" RE #119758 Fixes the above issue by // Record every work that we enqueue, rather than every work we create. // - we do not currently enqueue every created work, see coalescing in pointToPoint // - but it is UNSAFE to steal start/end event refs from works that may go out of scope, // and enqueueing in workMetaList is the mechanism by which we ensure they stay in scope // long enough for flight recorder to finish using them cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

… combo" RE #119758 Fixes the above issue by // Record every work that we enqueue, rather than every work we create. // - we do not currently enqueue every created work, see coalescing in pointToPoint // - but it is UNSAFE to steal start/end event refs from works that may go out of scope, // and enqueueing in workMetaList is the mechanism by which we ensure they stay in scope // long enough for flight recorder to finish using them cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

wconstab mentioned this issue Feb 13, 2024

[C10D] Add test case for crashing isend_irecv dump_entries combo #119757

Closed

jbschlosser added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flight Recorder dump_entries() segfaults when used with coalesced operations #119758

Flight Recorder dump_entries() segfaults when used with coalesced operations #119758

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Flight Recorder dump_entries() segfaults when used with coalesced operations #119758

Flight Recorder dump_entries() segfaults when used with coalesced operations #119758

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!