8000 Cuda reduce in a consistent direction by bunelr · Pull Request #1542 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Cuda reduce in a consistent direction #1542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

bunelr
Copy link
Contributor
@bunelr bunelr commented May 12, 2017

The bug was discovered by @danielamassiceti.

Repro script:

import torch
x = torch.zeros(10,1).cuda()
print(torch.max(x, 0))

y = torch.zeros(10,1)
print(torch.max(y,0))


x = torch.zeros(10).cuda()
print(torch.max(x, 0))

y = torch.zeros(10)
print(torch.max(y,0))

With the current master, the resulting index that are going to be obtained as max are for each print:
-> 9, 0, 0, 0

So the cuda implementation has a problem.
If you look here or here, you can see that the binaryOp function is called with the new value (later in the tensor) as first argument and the running max as second argument.

The checks made in the comparison function default to the first argument in case of equality, so that would be here defaulting to the new value. This is different to what is implemented on the CPU (see for example here )

This leads to the fix inside the struct. However, if we just apply these fixes, the resulting index from running the repro case becomes:
-> 0, 0, 7, 0

which is still wrong.

The reason for this is that when reducing over the results of the threads in the kernel, the arguments are passed as early in the array as first argument, late in the array as second argument.

This leads to the fix in kernelTransformReduceInnermostDimIndex.

An alternative fix would be to keep everything that I've changed as it was and swap the order of the arguments here and here. This would probably be cleaner.

Let me know what you think.Should I add a test for this?

EDIT: I actually went ahead and did the alternate fix. I find it cleaner.

@soumith
Copy link
Member
soumith commented May 12, 2017

it's not really a bug, but I'll take any attempt of making a consistency fix closer to CPU.
However, beware that your fix only makes things consistent between CPU and GPU for certain and small cases. CPU and GPU will give different indices in many other situations.

@bunelr
Copy link
Contributor Author
bunelr commented May 12, 2017

Yes, I see that the doc for max makes no guarantees about what index should be returned in the case where several elements have the same values so I personally don't rely on this.

Do you think it would be realistic to get a consistent implementation between CPU and GPU or is there some practical reason why this would be difficult?

If you think this is feasible, I may take a pass at it (After NIPS deadline though 😄 )

@soumith
Copy link
Member
soumith commented May 13, 2017

Do you think it would be realistic to get a consistent implementation between CPU and GPU or is there some practical reason why this would be difficult?

Not without taking a pretty big performance hit. You will have to sequentialize the reduction, and that means less parallel stuff.

Copy link
Contributor
@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should guarantee that CPU and GPU implementations return the same indices, but it seems the right thing to do as long as we don't have to sacrifice performance (as in this case).

@soumith
Copy link
Member
soumith commented May 15, 2017

merged into master now.

@soumith soumith closed this May 15, 2017
houseroad added a commit to houseroad/pytorch that referenced this pull request Dec 4, 2018
…013547

Summary:
Previous import was 6b34743d2e361bbc0acb29dd73536478cb92562e

Included changes:
- **[4280470](onnx/onnx@4280470)**: Changes done internally at Facebook (pytorch#1668) <Lu Fang>
- **[f85221f](onnx/onnx@f85221f)**: Fuse MatMul and Add into Gemm (pytorch#1542) <vloncar>
- **[022230e](onnx/onnx@022230e)**: Replace np.long by np.int64 (pytorch#1664) <G. Ramalingam>
- **[0ab3c95](onnx/onnx@0ab3c95)**: Infer shape from data in Constant nodes (pytorch#1667) <Shinichiro Hamaji>

Differential Revision: D13330082

fbshipit-source-id: 8bc0362533482e0edc5438642151a46eca67f18f
facebook-github-bot pushed a commit that referenced this pull request Dec 5, 2018
…013547 (#14777)

Summary:
Pull Request resolved: #14777

Previous import was 6b34743d2e361bbc0acb29dd73536478cb92562e

Included changes:
- **[4280470](onnx/onnx@4280470)**: Changes done internally at Facebook (#1668) <Lu Fang>
- **[f85221f](onnx/onnx@f85221f)**: Fuse MatMul and Add into Gemm (#1542) <vloncar>
- **[022230e](onnx/onnx@022230e)**: Replace np.long by np.int64 (#1664) <G. Ramalingam>
- **[0ab3c95](onnx/onnx@0ab3c95)**: Infer shape from data in Constant nodes (#1667) <Shinichiro Hamaji>

Reviewed By: bddppq

Differential Revision: D13330082

fbshipit-source-id: 13cf328626cf872d0983bbd2154d95c45da70f1c
zasdfgbnm pushed a commit to zasdfgbnm/pytorch that referenced this pull request Mar 31, 2022
To highlight the impact of the change, renamed `IterDomain::clone()` to `IterDomain::cloneWithoutRFactor()`.
petrex pushed a commit to petrex/pytorch that referenced this pull request Sep 23, 2024
…ronment variables (pytorch#1542)

Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can
compile correctly.

This is cherry-picked version of
pytorch#133866

Tested with `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python
setup.py develop --user` and `python -c 'import torch'`
akashveramd pushed a commit to akashveramd/pytorch that referenced this pull request Apr 9, 2025
* Introduce ck_host library and gemm_softmax_gemm.

* Minor refactor.

* Add descriptor to gemm_softmax_gemm.

* Bugfix.

* Revert ck_host library.

* fix clang format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0