-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Flight Recorder dump_entries() segfaults when used with coalesced operations #119758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
IIRC, the work object would be eventually enqueued in |
endCoalescing actually creates its own work object and enqueues that. I'm not sure what happens to the original work created inside the PointToPoint call? I added log messages inside pointToPoint() and endCoalescing on the 2 if/else branches where it either enqueue's work or does some cuda stream decr. This is an excerpt of my log from the unit test. (it continues longer, with the same pattern, never see work getting enqueued)
|
Check the comment in endCoalescing:
it only enqueues for if (coalescing_state_ & CoalColl) is true |
i think that's only bug 1 of 2. bug 2 is that we create 2 separate works. For the first work, we log it to the flight recorder, but then let it expire. Flight recorder steals references to the events and expects the work to stay alive. thats bad. |
We do not create work object for each send/recv op inside a batch_isend_irecv and enqueue them. The whole batch is treated as one work. |
According to my logs for my case, we do?
|
I think this line can be / should be removed from
It may be wrongly copy&paste from somewhere else. |
could you state the design intent for how batch_isend_irecv is supposed to work? (should it even enter pointToPoint function? if so, what should / shouldn't happen at that time, and then what should happen in endCoalescing?) |
I am not a big fan of this helper tbh. In #119421, I propose that both the But let say we enter this helper, what happens? To a So when can we do the recording? After |
we currently call it's a convenient invariant that we record and enqueue every work we create. (if we have to have exceptions to this rule, then we have to be more careful to not miss any works from recording). Do we actually need to create the work obj in pointToPoint? or can we kill that path off? (maybe we do have to create it in the case we do not use coalescing, but we don't want to create it if we do use coalescing? in that case it can be complicated..) |
Agree. Either would work:
Solution 1 may require slightly more work than solution 2. |
Actually, what would work best depends on what flight recorder wants to record: |
I think for the flight recorder it is best to record something close to what executes as one 'kernel'. For individual send/recv sizes, we can add that metadata all into the description of that particular kernel. (we could further abuse profilingTitle, or come up with a better way). But we have to decouple flight recorder from 'initWork' then i think. However, if 'the right thing' is to not create a work inside pointToPoint, then we don't have to decouple flight recorder from initWork and we can just do the harder but more correct thing of not creating a pointless work there. |
IMHO, better to be always consistent with the one to one mapping of (seq_ <-> flight recorder entry <->work enqueued) |
… combo" RE #119758 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
…ecv dump_entries combo" RE #119758 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
…ecv dump_entries combo" RE #119758 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
… combo" RE #119758 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
…ecv dump_entries combo" RE #119758 Fixes the above issue by // Record every work that we enqueue, rather than every work we create. // - we do not currently enqueue every created work, see coalescing in pointToPoint // - but it is UNSAFE to steal start/end event refs from works that may go out of scope, // and enqueueing in workMetaList is the mechanism by which we ensure they stay in scope // long enough for flight recorder to finish using them cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
… combo" RE #119758 Fixes the above issue by // Record every work that we enqueue, rather than every work we create. // - we do not currently enqueue every created work, see coalescing in pointToPoint // - but it is UNSAFE to steal start/end event refs from works that may go out of scope, // and enqueueing in workMetaList is the mechanism by which we ensure they stay in scope // long enough for flight recorder to finish using them cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]
Uh oh!
There was an error while loading. Please reload this page.
First observed when using batched_isend_irecv with send/recv P2pOps in PipelineParallel on
whc/pp
branch (torchtrain). Isolated a minimal repro (#119757)I suspect this issue applies to all collectives/point2point when the following conditions are true
We record the work into the flight recorder, which includes refs to cuda events
(always true, more or less)
We don't call workEnqueue because coalescing is active or something
(I don't understand this logic as well)
My hypothesis is that we are losing the work object (it's getting destructed) in the path we don't put it in the workMetaList. I would be surprised by this bc I assume something has to keep it alive for coalescing.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @shuqiangzhang
The text was updated successfully, but these errors were encountered: