-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Cuda reduce in a consistent direction #1542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
it's not really a bug, but I'll take any attempt of making a consistency fix closer to CPU. |
Yes, I see that the doc for max makes no guarantees about what index should be returned in the case where several elements have the same values so I personally don't rely on this. Do you think it would be realistic to get a consistent implementation between CPU and GPU or is there some practical reason why this would be difficult? If you think this is feasible, I may take a pass at it (After NIPS deadline though 😄 ) |
Not without taking a pretty big performance hit. You will have to sequentialize the reduction, and that means less parallel stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should guarantee that CPU and GPU implementations return the same indices, but it seems the right thing to do as long as we don't have to sacrifice performance (as in this case).
merged into master now. |
…013547 Summary: Previous import was 6b34743d2e361bbc0acb29dd73536478cb92562e Included changes: - **[4280470](onnx/onnx@4280470)**: Changes done internally at Facebook (pytorch#1668) <Lu Fang> - **[f85221f](onnx/onnx@f85221f)**: Fuse MatMul and Add into Gemm (pytorch#1542) <vloncar> - **[022230e](onnx/onnx@022230e)**: Replace np.long by np.int64 (pytorch#1664) <G. Ramalingam> - **[0ab3c95](onnx/onnx@0ab3c95)**: Infer shape from data in Constant nodes (pytorch#1667) <Shinichiro Hamaji> Differential Revision: D13330082 fbshipit-source-id: 8bc0362533482e0edc5438642151a46eca67f18f
…013547 (#14777) Summary: Pull Request resolved: #14777 Previous import was 6b34743d2e361bbc0acb29dd73536478cb92562e Included changes: - **[4280470](onnx/onnx@4280470)**: Changes done internally at Facebook (#1668) <Lu Fang> - **[f85221f](onnx/onnx@f85221f)**: Fuse MatMul and Add into Gemm (#1542) <vloncar> - **[022230e](onnx/onnx@022230e)**: Replace np.long by np.int64 (#1664) <G. Ramalingam> - **[0ab3c95](onnx/onnx@0ab3c95)**: Infer shape from data in Constant nodes (#1667) <Shinichiro Hamaji> Reviewed By: bddppq Differential Revision: D13330082 fbshipit-source-id: 13cf328626cf872d0983bbd2154d95c45da70f1c
To highlight the impact of the change, renamed `IterDomain::clone()` to `IterDomain::cloneWithoutRFactor()`.
…ronment variables (pytorch#1542) Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly. This is cherry-picked version of pytorch#133866 Tested with `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py develop --user` and `python -c 'import torch'`
* Introduce ck_host library and gemm_softmax_gemm. * Minor refactor. * Add descriptor to gemm_softmax_gemm. * Bugfix. * Revert ck_host library. * fix clang format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com>
The bug was discovered by @danielamassiceti.
Repro script:
With the current master, the resulting index that are going to be obtained as max are for each print:
-> 9, 0, 0, 0
So the cuda implementation has a problem.
If you look here or here, you can see that the binaryOp function is called with the new value (later in the tensor) as first argument and the running max as second argument.
The checks made in the comparison function default to the first argument in case of equality, so that would be here defaulting to the new value. This is different to what is implemented on the CPU (see for example here )
This leads to the fix inside the struct. However, if we just apply these fixes, the resulting index from running the repro case becomes:
-> 0, 0, 7, 0
which is still wrong.
The reason for this is that when reducing over the results of the threads in the kernel, the arguments are passed as early in the array as first argument, late in the array as second argument.
This leads to the fix in kernelTransformReduceInnermostDimIndex.
An alternative fix would be to keep everything that I've changed as it was and swap the order of the arguments here and here. This would probably be cleaner.
Let me know what you think.Should I add a test for this?
EDIT: I actually went ahead and did the alternate fix. I find it cleaner.