-
-
Notifications
You must be signed in to change notification settings - Fork 935
Use custom less instead of specializing thrust #8446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cupy/cuda/cupy_thrust.cu
Outdated
#endif | ||
bool thrust::less<float>::operator() ( | ||
const float& lhs, const float& rhs) const { | ||
struct thrust::less<float> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: I would think that we'd consider adding user-defined specializations for thrust::less<float>
to still be invalid.
For example, I believe in thrust::sort
we dispatch to a merge sort vs radix sort implementation based on the operator. If the operator is a well known operator for a well known type (like thrust::less).
If a user provides their own specialization of that operator, then that dispatch would no longer be valid because the radix sort code path wouldn't actually use their operator specialization.
TL;DR: I think the easiest solution here is to change this code from using and specializing thrust::less
directly and just add a custom comparison type (like cupy::less
) and pass that into the thrust::sort
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: I think the easiest solution here is to change this code from using and specializing
thrust::less
directly and just add a custom comparison type (likecupy::less
) and pass that into thethrust::sort
call.
Yes this sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I suspected, calling thrust::sort(..., thrust::less<float>())
with a custom specialization of thrust::less<float>
will never invoke that specialization. It dispatches to the RadixSort code path.
So cupy is just getting lucky that CUB's RadixSort happens to order NaNs in a way that matches whatever their tests expect.
https://godbolt.org/z/scezav1Yv
Same story for thrust::stable_sort
: https://godbolt.org/z/n56EPbo3Y
TL;DR: This is all the more reason that specializing thrust::less
is incorrect and cupy needs to define it's own custom cupy::less
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So cupy is just getting lucky that CUB's RadixSort happens to order NaNs in a way that matches whatever their tests expect.
This is actually not true. Let's discuss this example offline tomorrow when Bernhard is up:
>>> a = cp.random.random((3, 5), dtype=cp.float32)
>>> cp.sort(a, axis=-1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: I would think that we'd consider adding user-defined specializations for
thrust::less<float>
to still be invalid.
Yes we do. But continuing to provide a specialization for thrust::less
seemed the least different change and it would at least work with the planned changes in CCCL.
For example, I believe in
thrust::sort
we dispatch to a merge sort vs radix sort implementation based on the operator. If the operator is a well known operator for a well known type (like thrust::less).If a user provides their own specialization of that operator, then that dispatch would no longer be valid because the radix sort code path wouldn't actually use their operator specialization.
This is what's probably happening here.
TL;DR: I think the easiest solution here is to change this code from using and specializing
thrust::less
directly and just add a custom comparison type (likecupy::less
) and pass that into thethrust::sort
call.
I had to do a more complicated version, because we only want a custom less function object for the cases where the comparison semantics are changed. For integers, we would still want to use thrust::less
to benefit, e.g., from radix sort.
Thanks, @bernhardmgruber! Appreciate for the help!
Not sure what kind of errors that you ran into, care to share the full output? This is how we can quickly spin up a working environment as of today (including creating a working CUDA env): conda create -n cupy_cuda122_dev -y python=3.12 "numpy<2" "cython<3" fastrlock git pytest cuda cuda-version=12.2
conda activate cupy_cuda122_dev
git clone --recursive https://github.com/cupy/cupy.git
cd cupy
export CUPY_NVCC_GENERATE_CODE="arch=compute_86,code=sm_86;" # update to limit the build for your GPUs; a list of arch can be supplied with semi-colons
export CUPY_NUM_BUILD_JOBS=16
git submodule update --init --recursive
pip install --no-deps --no-build-isolation -v -e .
CUDA_PATH=$CONDA_PREFIX/targets/x86_64-linux/ pytest tests/cupy_tests/sorting_tests/ -x # -x to fail on first error |
07ec664
to
71d712b
Compare
I found out in the meantime that |
/test full |
This comment has been minimized.
This comment has been minimized.
I tried running that to setup an environment
but I get an error message that I could setup the packages manually outside conda. However, running the tests like: pytest tests/cupy_tests/sorting_tests/ -x starts executing, but crashes my display driver after several seconds :S I am investigating. |
543f47b
to
d15a26e
Compare
This pull request is now in conflicts. Could you fix it @bernhardmgruber? 🙏 |
It seems there was a conflict after merging #8412. Could you resolve conflicts @bernhardmgruber?
Several CuPy tests (with |
68ece90
to
dd145ae
Compare
I rebased my changes and noticed that I had to readd the custom comparison operator for I can run some more tests next week. |
dd145ae
to
323802d
Compare
It seems NaNs order correctly with CTK 12.2 in the absence of custom comparison operators when running With CTK 12.4 I got some weird errors about failing compilation here and there, so my tests may have been wrong. I would propose to let the CI be the arbiter of truth now. I don't trust my local setup :D |
/test mini |
Seeing a bunch of test failures on CI. Here is one example: ____________________________ TestLexsort.test_nan2 _____________________________
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:844: in test_func
impl(*args, **kw)
args = (<cupy_tests.sorting_tests.test_sort.TestLexsort testMethod=test_nan2>,)
dtype = 'f'
dtypes = 'efdFD'
impl = <function TestLexsort.test_nan2 at 0x7f855ebc0ae0>
kw = {'dtype': <class 'numpy.float32'>}
name = 'dtype'
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:363: in test_func
check_func(cupy_r, numpy_r)
accept_error = False
args = (<cupy_tests.sorting_tests.test_sort.TestLexsort testMethod=test_nan2>,)
check_func = <function numpy_cupy_array_equal.<locals>.check_func at 0x7f855ebc0680>
check_sparse_format = True
contiguous_check = False
cupy_error = None
cupy_numpy_result_ndarrays = [(array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7]), array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2]))]
cupy_r = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])
cupy_result = (array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7]),)
impl = <function TestLexsort.test_nan2 at 0x7f855ebc09a0>
kw = {'dtype': <class 'numpy.float32'>}
mask = None
masks = [None]
name = 'xp'
numpy_error = None
numpy_r = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])
numpy_result = (array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2]),)
scipy_name = None
skip = False
sp_name = None
type_check = True
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:667: in check_func
_array.assert_array_equal(x, y, err_msg, verbose, strides_check)
err_msg = ''
strides_check = False
verbose = True
x = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])
y = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])
/root/.local/lib/python3.11/site-packages/cupy/testing/_array.py:95: in assert_array_equal
numpy.testing.assert_array_equal(
err_msg = ''
kwargs = {}
strides_check = False
verbose = True
x = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])
y = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|
323802d
to
6d68ccf
Compare
/test mini |
Alright, I readded the specializations for
My understanding is the following:
We have two options:
What would you prefer? |
^ @kmaehashi what do you think? |
If we are unsure how to proceed here, would it make sense to defer this for the next release (IOW a later 13.x or simply 14.0.0) so we have more time to evaluate options (and possibly improve on them)? |
Sorry for dropping the ball, I think the performance of FP sorting should be prioritized over working with NaNs "correctly" (aligning with NumPy's specification), but would love to have inputs from other forks @takagi @asi1024 @leofang.
I agree this is not a blocker for the next release. |
6d68ccf
to
19f1fcb
Compare
The current way of specializing thrust::less::operator() is uncommon in C++ and not forward compatible to recent and future changes to CCCL. The proper way is to use a custom comparison object with the desired behavior. For sorting floating-points, thrust::less is used directly, since then thrust uses radix sort, which sorts NaNs to the back.
19f1fcb
to
4141f4a
Compare
I implemented approach 2. now. You should see the same performance as before. |
/test full |
I agree with the current change that delegates floating points sort to thrust that uses radix sort for that. |
CIs are congested to prepare for the release this week. Let me cancel CI runs for this one 🙇 I will re-kick the CI afterwards. |
Sorry for my late reply, slowly catching up after returning from time-off...
IIRC during past array API meetings @seberg had noted that the NaN-sorting behavior must be consistent between
It seems the NaN discussion is now moot? 🙂 |
/test full |
That said, one questions I raised (internally) is still unresolved: That Thrust uses radix sort under the hood is an implementation detail as far as CuPy is concerned. How do we proactively ensure that future Thrust updates would not break CuPy? This is a run-time behavior that needs an actual execution to confirm. |
One thing was about
|
The sorting order should be the same as before.
I see three options:
Mind that I cannot promise much time to work on any of these, unfortunately :S |
xref: data-apis/array-api#722
I think 3 (merging as-is) is the way to go for now, and let's track 1 as a long-term solution. Thanks, Bernhard! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate for the help, @bernhardmgruber, LGTM!
/test full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Use custom less instead of specializing thrust
Thanks @bernhardmgruber! |
The current way of specializing thrust::less::operator() is uncommon in C++ and not forward compatible to recent and future changes to CCCL. The proper way is to use a custom comparison object with the desired behavior.
I could not figure out how to run the unit tests covering the C++ part from your documentation.
pip install -e .
ran successfully, butpython -m pytest
shows some errors due to missing packages that I failed to install. I also have to admit, that my Python skills a horribly limited :SBesides, there are some further improvements that could be made tocupy_thrust.cu
, which depend on your minimum C++ standard and CCCL/rocthrust version. E.g.make_zip_iterator
does not requiremake_tuple
anymore.As a drive-by improvement, I dropped
make_tuple
when callingmake_zip_iterator
, which is no longer required.