8000 Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly by iupaikov-amd · Pull Request #153245 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly #153245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

iupaikov-amd
Copy link
Contributor
@iupaikov-amd iupaikov-amd commented May 9, 2025

Fixes #153239

Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link
pytorch-bot bot commented May 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153245

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3e9a54f with merge base 86c6f71 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator
@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.

@guangyey guangyey requested a review from EikanWang May 11, 2025 16:19
@guangyey
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased iupaikov_test_decompose_mem_bound_skip_fix_upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout iupaikov_test_decompose_mem_bound_skip_fix_upstream && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the iupaikov_test_decompose_mem_bound_skip_fix_upstream branch from 3a84b04 to e40b30c Compare May 11, 2025 16:57
@guangyey
Copy link
Collaborator

I’m concerned that some test cases in TestDecomposeMemMM may fail.

@EikanWang
Copy link
Collaborator

@leslie-fang-intel , I'm not sure if the CPU test case should be part of inductor/test_decompose_mem_bound_mm.py::TestDecomposeMemMM::test_decompose_mm_cpu_m_1_k_64_n_32_should_decompose_False. Could you help double check?

Copy link
Collaborator
@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to audit the code for other places where skipIfXpu might also be used to decorate the class?

@leslie-fang-intel
Copy link
Collaborator
leslie-fang-intel commented May 12, 2025

@leslie-fang-intel , I'm not sure if the CPU test case should be part of inductor/test_decompose_mem_bound_mm.py::TestDecomposeMemMM::test_decompose_mm_cpu_m_1_k_64_n_32_should_decompose_False. Could you help double check?

Thanks @EikanWang, checked the test case test_decompose_bmm_cpu:

def test_decompose_bmm_cpu(self, b, m, n, k, should_decompose):
after fixing the issue with this PR, and I found it will be triggered on cpu when both cuda and cpu are available in the system. I feel we do need to add more mm or bmm decomposed related test case since I remember the heuristic on cpu and cuda is different. The failure of test_decompose_mm_cpu_m_1_k_64_n_32_should_decompose_False might relate to the heuristic update, will take a look.

@leslie-fang-intel
Copy link
Collaborator

Check current heuristic of mm decomposition in

check_device(mat1, mat2, device="cpu")
and statically_known_true(mat1.shape[0] == 1)
and statically_known_true(mat2.shape[0] <= 128)
and statically_known_true(mat2.shape[1] <= 512)
, for this test case m_1_k_64_n_32, it should be decomposed. I think we can update the test case.

@iupaikov-amd
Copy link
Contributor Author

Do we need to audit the code for other places where skipIfXpu might also be used to decorate the class?

Yes, I haven't done a sweep of the whole repo, but if the same decorator is used on another test class then it should be fixed. I only caught this by accident while working on older release branch because the tests don't even show as skipped if they are broken in such way.

I’m concerned that some test cases in TestDecomposeMemMM may fail.

Yeah, they are going to probably fail for ROCm stack too, but we would like to be able to see the failures in order to fix them. I will have another PR which addresses the problem for our cases, this one only exist to bring attention to this issue.

I propose to merge the PR "as is" and create separate issues for fixing any errors with underlying tests to not bloat the scope of this one.

@EikanWang
Copy link
Collaborator

I have another PR #151420 to ensure that skipIfXPU does not disable test classes for other devices.

@iupaikov-amd , regarding this PR, I suggest disabling this case for CPU first and submitting an issue to track the issue.

@iupaikov-amd
Copy link
Contributor Author

I have another PR #151420 to ensure that skipIfXPU does not disable test classes for other devices.

@iupaikov-amd , regarding this PR, I suggest disabling this case for CPU first and submitting an issue to track the issue.

There is no point in merging mine then if your PR fixes the underlying issue with this decorator. Let's proceed with any solution, I don't have a preference.

@leslie-fang-intel
Copy link
Collaborator

@iupaikov-amd , regarding this PR, I suggest disabling this case for CPU first and submitting an issue to track the issue.

Sure, please assign this issue to me and I will take further look.

@EikanWang
Copy link
Collaborator

@leslie-fang-intel , I have not created an issue yet. Could you help create the issue directly and then assign it to yourselves? After that, pls. disable this test case.

@leslie-fang-intel
Copy link
Collaborator

Sure, #153585 created to track the failure of inductor/test_decompose_mem_bound_mm.py::TestDecomposeMemMM::test_decompose_mm_cpu_m_1_k_64_n_32_should_decompose_False.

Hi @iupaikov-amd would you like to disable this UT in this PR or I can also create another PR to disable it? Please kindly let me know your preference. Then after the landing of this PR, I will fix and re-enable this UT.

@EikanWang
Copy link
Collaborator

@leslie-fang-intel , I just disabled this case and created an issue - #153616.

@iupaikov-amd
Copy link
Contributor Author

Sure, #153585 created to track the failure of inductor/test_decompose_mem_bound_mm.py::TestDecomposeMemMM::test_decompose_mm_cpu_m_1_k_64_n_32_should_decompose_False.

Hi @iupaikov-amd would you like to disable this UT in this PR or I can also create another PR to disable it? Please kindly let me know your preference. Then after the landing of this PR, I will fix and re-enable this UT.

Let's merge this and you can disable it in the other one. Thanks for the quick response!

@iupaikov-amd
Copy link
Contributor Author

@pytorchbot rebase

1 similar comment
@jataylo
Copy link
Collaborator
jataylo commented May 15, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased iupaikov_test_decompose_mem_bound_skip_fix_upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout iupaikov_test_decompose_mem_bound_skip_fix_upstream && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the iupaikov_test_decompose_mem_bound_skip_fix_upstream branch from e40b30c to 4693026 Compare May 15, 2025 14:37
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #153245, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@EikanWang EikanWang added the keep-going Don't stop on first failure, keep running tests until the end label May 16, 2025
@EikanWang
Copy link
Collaborator

I just labeled this PR with keep-going to expose all failures.

@iupaikov-amd
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased iupaikov_test_decompose_mem_bound_skip_fix_upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout iupaikov_test_decompose_mem_bound_skip_fix_upstream && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the iupaikov_test_decompose_mem_bound_skip_fix_upstream branch from 4693026 to 327d9f2 Compare May 16, 2025 09:36
@jataylo jataylo added ciflow/rocm Trigger "default" config CI on ROCm ciflow/xpu Run XPU CI tasks labels May 16, 2025
@jeffdaily
Copy link
Collaborator

@pytorchbot rebase

@iupaikov-amd
Copy link
Contributor Author

Skipped failing tests which will need to be looked at later:

#153735
#153736

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased iupaikov_test_decompose_mem_bound_skip_fix_upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout iupaikov_test_decompose_mem_bound_skip_fix_upstream && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the iupaikov_test_decompose_mem_bound_skip_fix_upstream branch from 327d9f2 to 3e9a54f Compare May 16, 2025 16:11
@pytorch-bot pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/xpu Run XPU CI tasks labels May 16, 2025
@jeffdaily jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/xpu Run XPU CI tasks labels May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm Trigger "default" config CI on ROCm ciflow/xpu Run XPU CI tasks keep-going Don't stop on first failure, keep running tests until the end module: inductor open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XPU skip breaks the test_decompose_mem_bound_mm.py test suite
8 participants
0