[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor #144007

hongxiayang · 2024-12-31T00:00:36Z

This PR is to fix the invalid configuration argument problem happened on ROCm when input is a large tensor when calling torch.layer_norm.

 File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2573, in layer_norm
    return torch.layer_norm
RuntimeError: HIP error: invalid configuration argument

After investigation, I found that the reason why this error happened is: The amd compute language runtime checks whether gridDim.x * blockDim.x is greater than std::numeric_limits<uint32_t>::max() or not. If yes, it will error out with the "invalid configuration argument" message.

The fix is to split the whole task to several chunks so that each chunk will not trigger the failure condition. This will ensure the correctness and completeness given the current kernel implementation logic of vectorized_layer_norm_kernel.

Also added a largeTensor layer_norm unit test test_layer_norm_large_tensor with the same shape [16, 3000, 3000, 16] as the one used by the pytorch issue #136291 so that the unit test can check the expected output value to ensure correctness.

The future work may include performance optimization of layer_norm and CK layer_norm integration.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @naromero77amd

…ge tensor

pytorch-bot · 2024-12-31T00:00:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144007

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b67eb8b with merge base 2966fb3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy

Please add a test for this case

hongxiayang · 2024-12-31T22:29:28Z

Please add a test for this case

will do.
Added a large tensor test.

…r its output value too

…rectness

hongxiayang · 2025-01-07T14:34:17Z

@eqy @malfet : All checks are green now. Can you help to merge this PR? Thanks.

eqy · 2025-01-07T15:48:51Z

@pytorchmergebot merge

pytorchmergebot · 2025-01-07T15:50:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[ROCm] fix layer_norm invalid configuration problem when input is lar…

a76917d

…ge tensor

hongxiayang requested review from eqy and syed-ahmed as code owners December 31, 2024 00:00

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Dec 31, 2024

fat finger

ab34ee3

pytorchbot added the open source label Dec 31, 2024

hongxiayang mentioned this pull request Dec 31, 2024

On AMD GPUs (ROCm 5.7-6.2), cannot backpropagate loss tensor containing more than 2e8 elements #136291

Closed

eqy approved these changes Dec 31, 2024

View reviewed changes

fix the lint and const issue

1a9c99e

hongxiayang and others added 6 commits January 2, 2025 18:39

add a large tensor test for layer norm

c2f9556

reworked on the fix to ensure all elements are processed

74f6975

change the large tensor test to use tensor with ones and add check fo…

eff5476

…r its output value too

Update test_nn.py to remove white space

6d88692

forgot lintrunner

e4d65f4

update the large tensor test to compare with a smaller tensor for cor…

b67eb8b

…rectness

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 7, 2025

pytorchmergebot added the merging label Jan 7, 2025

pytorchmergebot added the Merged label Jan 7, 2025

pytorchmergebot closed this in aa69d73 Jan 7, 2025

pytorchmergebot removed the merging label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor #144007

[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor #144007

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor #144007

[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor #144007

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144007

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!