8000 [rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 by iupaikov-amd · Pull Request #1882 · ROCm/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 #1882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

iupaikov-amd
Copy link
@iupaikov-amd iupaikov-amd commented Feb 5, 2025

The test fails with assertion error "Tensors are not close"

After testing I can confirm that this issue is caused by eager mode execution specific to navi32 during the test_sdpa_2 run. Made a cross reference between navi31, navi32 and mi300. AOTInductor results are all the exact same for all of the archs, only the eager mode fails here for navi32 with 1.5% difference in tensor values from the gpu run. I assume that this happens due to fp16-32-16 conversions in eager mode or missing some if-statements for navi32 specifically.

Simple reproducer to check the values for cpu/gpu/eager/aoti runs.
gfx1101_test_sdpa_2_issue_reproducer.txt

@rocm-repo-management-api
Copy link
rocm-repo-management-api bot commented Feb 5, 2025

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jataylo
Copy link
jataylo commented Feb 5, 2025

@iupaikov-amd please add PR description explaining the justification for skipping

@rocm-repo-management-api
Copy link
rocm-repo-management-api bot commented Feb 5, 2025

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@iupaikov-amd
Copy link
Author

Added description

@rocm-repo-management-api
Copy link
rocm-repo-management-api bot commented Feb 7, 2025

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link
rocm-repo-management-api bot commented Feb 8, 2025

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link
rocm-repo-management-api bot commented Feb 11, 2025 8000

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link
rocm-repo-management-api bot commented Feb 18, 2025

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony pruthvistony merged commit 896c789 into rocm6.4_internal_testing Feb 19, 2025
9 of 13 checks passed
@pruthvistony pruthvistony deleted the iupaikov_test_sdpa_2_skip_rocm6.4 branch February 19, 2025 17:23
dnikolaev-amd pushed a commit that referenced this pull request Feb 24, 2025
…nductor for Navi32 (#1882)

The test fails with assertion error "Tensors are not close"

After testing I can confirm that this issue is caused by eager mode
execution specific to navi32 during the test_sdpa_2 run. Made a cross
reference between navi31, navi32 and mi300. AOTInd
8000
uctor results are all
the exact same for all of the archs, only the eager mode fails here for
navi32 with 1.5% difference in tensor values from the gpu run. I assume
that this happens due to fp16-32-16 conversions in eager mode or missing
some if-statements for navi32 specifically.

Simple reproducer to check the values for cpu/gpu/eager/aoti runs.

[gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt)
dnikolaev-amd pushed a commit that referenced this pull request Apr 17, 2025
…nductor for Navi32 (#1882)

The test fails with assertion error "Tensors are not close"

After testing I can confirm that this issue is caused by eager mode
execution specific to navi32 during the test_sdpa_2 run. Made a cross
reference between navi31, navi32 and mi300. AOTInductor results are all
the exact same for all of the archs, only the eager mode fails here for
navi32 with 1.5% difference in tensor values from the gpu run. I assume
that this happens due to fp16-32-16 conversions in eager mode or missing
some if-statements for navi32 specifically.

Simple reproducer to check the values for cpu/gpu/eager/aoti runs.

[gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt)

(cherry picked from commit 896c789)
dnikolaev-amd pushed a commit that referenced this pull request Apr 24, 2025
=================================================

Temporarily skip test_conv3d_64bit_indexing

- Rocblas API support is requested
- SWDEV-383635 & sub task - SWDEV-390218

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Add skipIfRocmArch decorator for Navi skips (#1356)

Converted NAVI check as a function (#1364)

* Moved NAVI check to the test file

* Revised NAVI check as a function

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Bad import in test_torchinductor and skip torchvision related UT (#1374)

skip test_inductor_freezing failing UTs (#1375)

Skip test_mm_triton_kernel_benchmark (#1376)

* Running triton kernel on ROCM only has one GB/s metric reported

* Update test_kernel_benchmark.py

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

skipIfRocm needs msg parameter

[NO CP] Updated changes to skip few UTs

Imported skipIfRocm in certain test suites (#1577)

Fixes SWDEV-472397

Added functions imports (#1521)

Fixes
inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

Enable test_public_api_surface (#1601)

Fixes SWDEV-462410.

Enable this unit test since PyTorch issue
pytorch#104012 has been closed. This
unit test runs fine on MI100/MI300 and upstream.

(cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67)

[rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607)

Fixes pytorch#8974

(cherry picked from commit a688e0a)
(cherry picked from commit b966e44)

[rocm6.4_internal_testing] Skip non_standard_bool_values tests (#1880)

Fixes SWDEV-509757

(cherry picked from commit 80b4c41)

[rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 (#1882)

The test fails with assertion error "Tensors are not close"

After testing I can confirm that this issue is caused by eager mode
execution specific to navi32 during the test_sdpa_2 run. Made a cross
reference between navi31, navi32 and mi300. AOTInductor results are all
the exact same for all of the archs, only the eager mode fails here for
navi32 with 1.5% difference in tensor values from the gpu run. I assume
that this happens due to fp16-32-16 conversions in eager mode or missing
some if-statements for navi32 specifically.

Simple reproducer to check the values for cpu/gpu/eager/aoti runs.

[gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt)

(cherry picked from commit 896c789)

Fixed rocm skip import issue (#1949)

skip_if_rocm does not exist in
torch/testing/_internal/common_distributed.py. Use skipIfRocm from
torch/testing/_internal/common_utils.py instead.

(cherry picked from commit cfb673e)

Skip certain unit tests on NAVI (#1950)

This PR is to skip certain unit tests on NAVI only.
Fixes SWDEV-509011 - test_sac_ilp.py::TestSACILP::test_sac_ilp_case1
Fixes SWDEV-509311 -
test_max_autotune.py::TestMaxAutotune::test_non_contiguous_input_addmm
Fixes SWDEV-510738
test_fsdp_sharded_grad_scaler.py::TestShardedGradScalerParityWithDDP::test_sharded_grad_scaler_found_inf

(cherry picked from commit e86291a)
@iupaikov-amd
Copy link
Author

!cherry-pick --onto release/2.6

@okakarpa
Copy link
Collaborator
okakarpa commented May 6, 2025

Created branch autogenerated/release/2.6_cherry-pick_pr-1882 and #2092. It contains a merge conflict. Please resolve it

jithunnair-amd pushed a commit that referenced this pull request May 6, 2025
…pped sdpa_2 test in test_aot_inductor for Navi32 (#2092)

Cherry-pick of #1882 
Need to resolve conflicts

---------

Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0