Cuda reduce in a consistent direction #1542

bunelr · 2017-05-12T01:35:42Z

The bug was discovered by @danielamassiceti.

Repro script:

import torch
x = torch.zeros(10,1).cuda()
print(torch.max(x, 0))

y = torch.zeros(10,1)
print(torch.max(y,0))


x = torch.zeros(10).cuda()
print(torch.max(x, 0))

y = torch.zeros(10)
print(torch.max(y,0))

With the current master, the resulting index that are going to be obtained as max are for each print:
-> 9, 0, 0, 0

So the cuda implementation has a problem.
If you look here or here, you can see that the binaryOp function is called with the new value (later in the tensor) as first argument and the running max as second argument.

The checks made in the comparison function default to the first argument in case of equality, so that would be here defaulting to the new value. This is different to what is implemented on the CPU (see for example here )

This leads to the fix inside the struct. However, if we just apply these fixes, the resulting index from running the repro case becomes:
-> 0, 0, 7, 0

which is still wrong.

The reason for this is that when reducing over the results of the threads in the kernel, the arguments are passed as early in the array as first argument, late in the array as second argument.

This leads to the fix in kernelTransformReduceInnermostDimIndex.

An alternative fix would be to keep everything that I've changed as it was and swap the order of the arguments here and here. This would probably be cleaner.

Let me know what you think.Should I add a test for this?

EDIT: I actually went ahead and did the alternate fix. I find it cleaner.

soumith · 2017-05-12T03:13:13Z

it's not really a bug, but I'll take any attempt of making a consistency fix closer to CPU.
However, beware that your fix only makes things consistent between CPU and GPU for certain and small cases. CPU and GPU will give different indices in many other situations.

bunelr · 2017-05-12T03:22:42Z

Yes, I see that the doc for max makes no guarantees about what index should be returned in the case where several elements have the same values so I personally don't rely on this.

Do you think it would be realistic to get a consistent implementation between CPU and GPU or is there some practical reason why this would be difficult?

If you think this is feasible, I may take a pass at it (After NIPS deadline though 😄 )

soumith · 2017-05-13T04:22:43Z

Do you think it would be realistic to get a consistent implementation between CPU and GPU or is there some practical reason why this would be difficult?

Not without taking a pretty big performance hit. You will have to sequentialize the reduction, and that means less parallel stuff.

apaszke

I don't think we should guarantee that CPU and GPU implementations return the same indices, but it seems the right thing to do as long as we don't have to sacrifice performance (as in this case).

soumith · 2017-05-15T17:26:40Z

merged into master now.

…013547 Summary: Previous import was 6b34743d2e361bbc0acb29dd73536478cb92562e Included changes: - **[4280470](onnx/onnx@4280470)**: Changes done internally at Facebook (pytorch#1668) <Lu Fang> - **[f85221f](onnx/onnx@f85221f)**: Fuse MatMul and Add into Gemm (pytorch#1542) <vloncar> - **[022230e](onnx/onnx@022230e)**: Replace np.long by np.int64 (pytorch#1664) <G. Ramalingam> - **[0ab3c95](onnx/onnx@0ab3c95)**: Infer shape from data in Constant nodes (pytorch#1667) <Shinichiro Hamaji> Differential Revision: D13330082 fbshipit-source-id: 8bc0362533482e0edc5438642151a46eca67f18f

…013547 (#14777) Summary: Pull Request resolved: #14777 Previous import was 6b34743d2e361bbc0acb29dd73536478cb92562e Included changes: - **[4280470](onnx/onnx@4280470)**: Changes done internally at Facebook (#1668) <Lu Fang> - **[f85221f](onnx/onnx@f85221f)**: Fuse MatMul and Add into Gemm (#1542) <vloncar> - **[022230e](onnx/onnx@022230e)**: Replace np.long by np.int64 (#1664) <G. Ramalingam> - **[0ab3c95](onnx/onnx@0ab3c95)**: Infer shape from data in Constant nodes (#1667) <Shinichiro Hamaji> Reviewed By: bddppq Differential Revision: D13330082 fbshipit-source-id: 13cf328626cf872d0983bbd2154d95c45da70f1c

To highlight the impact of the change, renamed `IterDomain::clone()` to `IterDomain::cloneWithoutRFactor()`.

…ronment variables (pytorch#1542) Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly. This is cherry-picked version of pytorch#133866 Tested with `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py develop --user` and `python -c 'import torch'`

* Introduce ck_host library and gemm_softmax_gemm. * Minor refactor. * Add descriptor to gemm_softmax_gemm. * Bugfix. * Revert ck_host library. * fix clang format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com>

bunelr added 2 commits May 11, 2017 18:07

Cuda reduce in a consistent direction

b0518dd

Nicer looking fix for the Min/MaxReduce

53dfce9

apaszke approved these changes May 15, 2017

View reviewed changes

soumith closed this May 15, 2017

houseroad mentioned this pull request Dec 4, 2018

Automatic update of fbcode/onnx to 42804705bdbf179d1a98394008417e1392013547 #14777

Closed

ezyang added the open source label Jun 24, 2019

zasdfgbnm pushed a commit to zasdfgbnm/pytorch that referenced this pull request Mar 31, 2022

Do not propagate rfactor flag (pytorch#1542)

97eb2e2

To highlight the impact of the change, renamed `IterDomain::clone()` to `IterDomain::cloneWithoutRFactor()`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda reduce in a consistent direction #1542

Cuda reduce in a consistent direction #1542

Cuda reduce in a consistent direction #1542

Cuda reduce in a consistent direction #1542

Conversation

Choose a reason for hiding this comment