8000 [torch][ao] Do not crash numerics debugger if the shape of the tensors do not match by dulinriley · Pull Request #149330 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[torch][ao] Do not crash numerics debugger if the shape of the tensors do not match #149330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dulinriley
Copy link
Contributor
@dulinriley dulinriley commented Mar 17, 2025

Summary: Occasionally we see the loss function to crash because the shapes of a and b tensors are different. This diff avoids crashing is such scenarios and lets the comparison work for other nodes where the shapes match.

Test Plan: - CI

Reviewed By: jerryzh168

Copy link
pytorch-bot bot commented Mar 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149330

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit c755508 with merge base 5e9f792 (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66245053

@dulinriley dulinriley changed the title [torch][ao] Do no crash numerics debugger if the shape of the tensors do not match [torch][ao] Do not crash numerics debugger if the shape of the tensors do not match Mar 17, 2025
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 17, 2025
@dulinriley
8000
Copy link
Contributor Author

This is a redo of #141257 but with a different author who has permissions

dulinriley added a commit to dulinriley/pytorch that referenced this pull request Mar 17, 2025
…s do not match (pytorch#149330)

Summary:

Occasionally we see the loss function to crash because the shapes of `a` and `b` tensors are different. This diff avoids crashing is such scenarios and lets the comparison work for other nodes where the shapes match.

Test Plan: - CI

Reviewed By: jerryzh168

Differential Revision: D66245053
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66245053

dulinriley added a commit to dulinriley/pytorch that referenced this pull request Mar 17, 2025
…s do not match (pytorch#149330)

Summary:

Occasionally we see the loss function to crash because the shapes of `a` and `b` tensors are different. This diff avoids crashing is such scenarios and lets the comparison work for other nodes where the shapes match.

Test Plan: - CI

Reviewed By: jerryzh168

Differential Revision: D66245053
dulinriley added a commit to dulinriley/pytorch that referenced this pull request Mar 17, 2025
…s do not match (pytorch#149330)

Summary:

Occasionally we see the loss function to crash because the shapes of `a` and `b` tensors are different. This diff avoids crashing is such scenarios and lets the comparison work for other nodes where the shapes match.

Test Plan: - CI

Reviewed By: jerryzh168

Differential Revision: D66245053
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66245053

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66245053

dulinriley added a commit to dulinriley/pytorch that referenced this pull request Mar 17, 2025
…s do not match (pytorch#149330)

Summary:
Pull Request resolved: pytorch#149330

Occasionally we see the loss function to crash because the shapes of `a` and `b` tensors are different. This diff avoids crashing is such scenarios and lets the comparison work for other nodes where the shapes match.

Test Plan: - CI

Reviewed By: jerryzh168

Differential Revision: D66245053
…s do not match (pytorch#149330)

Summary:
Pull Request resolved: pytorch#149330

Occasionally we see the loss function to crash because the shapes of `a` and `b` tensors are different. This diff avoids crashing is such scenarios and lets the comparison work for other nodes where the shapes match.

Test Plan: - CI

Reviewed By: jerryzh168

Differential Revision: D66245053
@dulinriley
Copy link
Contributor Author

@pytorchbot merge

Copy link
pytorch-bot bot commented Mar 17, 2025

This PR needs to be approved by an authorized maintainer before merge.

@dulinriley dulinriley self-assigned this Mar 17, 2025
@dulinriley dulinriley requested a review from jerryzh168 March 17, 2025 22:23
@dulinriley
Copy link
Contributor Author

@jerryzh168 can you stamp this? I need an authorized reviewer

Copy link
Contributor
@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are those cases? can you add a test btw?

@dulinriley
Copy link
Contributor Author

what are those cases? can you add a test btw?

This wasn't my diff, I commandeered it from someone who didn't have permissions to land, it was previously accepted internally.

This was mostly to avoid a very large exception message relating to printing tensors instead of their shapes

@dulinriley
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@jerryzh168
Copy link
Contributor

what are those cases? can you add a test btw?

This wasn't my diff, I commandeered it from someone who didn't have permissions to land, it was previously accepted internally.

This was mostly to avoid a very large exception message relating to printing tensors instead of their shapes

I see, then should there be a raise Excpetion under the if statement

"Cannot compare tensors with different shapes: actual=%s vs ref=%s",
a.shape,
b.shape,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we raise exception here instead of ignore the error

Copy link
Contributor Author
@dulinriley dulinriley Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because this type of error would typically indicate the debug handle was placed on the wrong nodes to compare before and after. Debug handles should always produce tensors of the same shape.

But debug handle errors should not prevent reporting of other accuracy errors, which an exception would do

Comment on lines +340 to +342
if not isinstance(a, torch.Tensor)
or not isinstance(b, torch.Tensor)
or a.shape == b.shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also with this, some of the mismatches will be ignored, that's probably not what we want I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to ignore mismatches if they were comparing this of different types, or when the tensors have different shapes, because it's a problem with the debug handles at that point

Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label May 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0