[rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 #1882

iupaikov-amd · 2025-02-05T17:28:52Z

The test fails with assertion error "Tensors are not close"

After testing I can confirm that this issue is caused by eager mode execution specific to navi32 during the test_sdpa_2 run. Made a cross reference between navi31, navi32 and mi300. AOTInductor results are all the exact same for all of the archs, only the eager mode fails here for navi32 with 1.5% difference in tensor values from the gpu run. I assume that this happens due to fp16-32-16 conversions in eager mode or missing some if-statements for navi32 specifically.

Simple reproducer to check the values for cpu/gpu/eager/aoti runs.
gfx1101_test_sdpa_2_issue_reproducer.txt

rocm-repo-management-api · 2025-02-05T17:46:12Z

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

jataylo · 2025-02-05T19:34:21Z

@iupaikov-amd please add PR description explaining the justification for skipping

rocm-repo-management-api · 2025-02-05T19:45:37Z

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

iupaikov-amd · 2025-02-06T14:54:34Z

Added description

rocm-repo-management-api · 2025-02-07T17:40:56Z

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-08T00:10:48Z

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-11T19:10:52Z

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-18T18:25:40Z

Jenkins build for 5d647c36630d8d201cfe8a29820943bc4c2191a2 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

…nductor for Navi32 (#1882) The test fails with assertion error "Tensors are not close" After testing I can confirm that this issue is caused by eager mode execution specific to navi32 during the test_sdpa_2 run. Made a cross reference between navi31, navi32 and mi300. AOTInd 8000 uctor results are all the exact same for all of the archs, only the eager mode fails here for navi32 with 1.5% difference in tensor values from the gpu run. I assume that this happens due to fp16-32-16 conversions in eager mode or missing some if-statements for navi32 specifically. Simple reproducer to check the values for cpu/gpu/eager/aoti runs. [gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt)

…nductor for Navi32 (#1882) The test fails with assertion error "Tensors are not close" After testing I can confirm that this issue is caused by eager mode execution specific to navi32 during the test_sdpa_2 run. Made a cross reference between navi31, navi32 and mi300. AOTInductor results are all the exact same for all of the archs, only the eager mode fails here for navi32 with 1.5% difference in tensor values from the gpu run. I assume that this happens due to fp16-32-16 conversions in eager mode or missing some if-statements for navi32 specifically. Simple reproducer to check the values for cpu/gpu/eager/aoti runs. [gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt) (cherry picked from commit 896c789)

================================================= Temporarily skip test_conv3d_64bit_indexing - Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218 Skip ddp apply_optim_in_bwd tests for gloo (#1302) To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834 Add skipIfRocmArch decorator for Navi skips (#1356) Converted NAVI check as a function (#1364) * Moved NAVI check to the test file * Revised NAVI check as a function [Navi] [Inductor] Unskip Navi inductor UTs (#1514) Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590 Bad import in test_torchinductor and skip torchvision related UT (#1374) skip test_inductor_freezing failing UTs (#1375) Skip test_mm_triton_kernel_benchmark (#1376) * Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420) skipIfRocm needs msg parameter [NO CP] Updated changes to skip few UTs Imported skipIfRocm in certain test suites (#1577) Fixes SWDEV-472397 Added functions imports (#1521) Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda Enable test_public_api_surface (#1601) Fixes SWDEV-462410. Enable this unit test since PyTorch issue pytorch#104012 has been closed. This unit test runs fine on MI100/MI300 and upstream. (cherry picked from commit 0001d4ab5070635cfecc146ee299bbb9fa45ca67) [rocm6.3_internal_testing] Fixed error string assertion in test_invalid_devices (#1607) Fixes pytorch#8974 (cherry picked from commit a688e0a) (cherry picked from commit b966e44) [rocm6.4_internal_testing] Skip non_standard_bool_values tests (#1880) Fixes SWDEV-509757 (cherry picked from commit 80b4c41) [rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 (#1882) The test fails with assertion error "Tensors are not close" After testing I can confirm that this issue is caused by eager mode execution specific to navi32 during the test_sdpa_2 run. Made a cross reference between navi31, navi32 and mi300. AOTInductor results are all the exact same for all of the archs, only the eager mode fails here for navi32 with 1.5% difference in tensor values from the gpu run. I assume that this happens due to fp16-32-16 conversions in eager mode or missing some if-statements for navi32 specifically. Simple reproducer to check the values for cpu/gpu/eager/aoti runs. [gfx1101_test_sdpa_2_issue_reproducer.txt](https://github.com/user-attachments/files/18676367/gfx1101_test_sdpa_2_issue_reproducer.txt) (cherry picked from commit 896c789) Fixed rocm skip import issue (#1949) skip_if_rocm does not exist in torch/testing/_internal/common_distributed.py. Use skipIfRocm from torch/testing/_internal/common_utils.py instead. (cherry picked from commit cfb673e) Skip certain unit tests on NAVI (#1950) This PR is to skip certain unit tests on NAVI only. Fixes SWDEV-509011 - test_sac_ilp.py::TestSACILP::test_sac_ilp_case1 Fixes SWDEV-509311 - test_max_autotune.py::TestMaxAutotune::test_non_contiguous_input_addmm Fixes SWDEV-510738 test_fsdp_sharded_grad_scaler.py::TestShardedGradScalerParityWithDDP::test_sharded_grad_scaler_found_inf (cherry picked from commit e86291a)

iupaikov-amd · 2025-05-06T14:30:57Z

!cherry-pick --onto release/2.6

okakarpa · 2025-05-06T14:32:54Z

Created branch autogenerated/release/2.6_cherry-pick_pr-1882 and #2092. It contains a merge conflict. Please resolve it

…pped sdpa_2 test in test_aot_inductor for Navi32 (#2092) Cherry-pick of #1882 Need to resolve conflicts --------- Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com>

Skipped sdpa_2 test in test_aot_inductor

5d647c3

iupaikov-amd requested review from pruthvistony, jithunnair-amd and jataylo February 5, 2025 17:28

pruthvistony approved these changes Feb 19, 2025

View reviewed changes

pruthvistony merged commit 896c789 into rocm6.4_internal_testing Feb 19, 2025
9 of 13 checks passed

pruthvistony deleted the iupaikov_test_sdpa_2_skip_rocm6.4 branch February 19, 2025 17:23

okakarpa mentioned this pull request May 6, 2025

[AUTOGENERATED] [release/2.6] [rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 #2092

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 #1882

[rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 #1882

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 #1882

[rocm6.4_internal_testing] [NAVI32] Skipped sdpa_2 test in test_aot_inductor for Navi32 #1882

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!