Use custom less instead of specializing thrust #8446

bernhardmgruber · 2024-07-31T13:31:15Z

The current way of specializing thrust::less::operator() is uncommon in C++ and not forward compatible to recent and future changes to CCCL. The proper way is to use a custom comparison object with the desired behavior.

I could not figure out how to run the unit tests covering the C++ part from your documentation. pip install -e . ran successfully, but python -m pytest shows some errors due to missing packages that I failed to install. I also have to admit, that my Python skills a horribly limited :S

Besides, there are some further improvements that could be made to cupy_thrust.cu, which depend on your minimum C++ standard and CCCL/rocthrust version. E.g. make_zip_iterator does not require make_tuple anymore.
As a drive-by improvement, I dropped make_tuple when calling make_zip_iterator, which is no longer required.

jrhemstad · 2024-07-31T17:18:19Z

cupy/cuda/cupy_thrust.cu

-#endif
-bool thrust::less<float>::operator() (
-    const float& lhs, const float& rhs) const {
+struct thrust::less<float> {


important: I would think that we'd consider adding user-defined specializations for thrust::less<float> to still be invalid.

For example, I believe in thrust::sort we dispatch to a merge sort vs radix sort implementation based on the operator. If the operator is a well known operator for a well known type (like thrust::less).

If a user provides their own specialization of that operator, then that dispatch would no longer be valid because the radix sort code path wouldn't actually use their operator specialization.

TL;DR: I think the easiest solution here is to change this code from using and specializing thrust::less directly and just add a custom comparison type (like cupy::less) and pass that into the thrust::sort call.

TL;DR: I think the easiest solution here is to change this code from using and specializing thrust::less directly and just add a custom comparison type (like cupy::less) and pass that into the thrust::sort call.

Yes this sounds good to me.

As I suspected, calling thrust::sort(..., thrust::less<float>()) with a custom specialization of thrust::less<float> will never invoke that specialization. It dispatches to the RadixSort code path.

So cupy is just getting lucky that CUB's RadixSort happens to order NaNs in a way that matches whatever their tests expect.

https://godbolt.org/z/scezav1Yv

Same story for thrust::stable_sort: https://godbolt.org/z/n56EPbo3Y

TL;DR: This is all the more reason that specializing thrust::less is incorrect and cupy needs to define it's own custom cupy::less operator.

So cupy is just getting lucky that CUB's RadixSort happens to order NaNs in a way that matches whatever their tests expect.

This is actually not true. Let's discuss this example offline tomorrow when Bernhard is up:

>>> a = cp.random.random((3, 5), dtype=cp.float32) >>> cp.sort(a, axis=-1)

important: I would think that we'd consider adding user-defined specializations for thrust::less<float> to still be invalid.

Yes we do. But continuing to provide a specialization for thrust::less seemed the least different change and it would at least work with the planned changes in CCCL.

For example, I believe in thrust::sort we dispatch to a merge sort vs radix sort implementation based on the operator. If the operator is a well known operator for a well known type (like thrust::less).

If a user provides their own specialization of that operator, then that dispatch would no longer be valid because the radix sort code path wouldn't actually use their operator specialization.

This is what's probably happening here.

TL;DR: I think the easiest solution here is to change this code from using and specializing thrust::less directly and just add a custom comparison type (like cupy::less) and pass that into the thrust::sort call.

I had to do a more complicated version, because we only want a custom less function object for the cases where the comparison semantics are changed. For integers, we would still want to use thrust::less to benefit, e.g., from radix sort.

leofang · 2024-07-31T19:37:46Z

Thanks, @bernhardmgruber! Appreciate for the help!

I could not figure out how to run the unit tests covering the C++ part from your documentation. pip install -e . ran successfully, but python -m pytest shows some errors due to missing packages that I failed to install. I also have to admit, that my Python skills a horribly limited :S

Not sure what kind of errors that you ran into, care to share the full output?

This is how we can quickly spin up a working environment as of today (including creating a working CUDA env):

conda create -n cupy_cuda122_dev -y python=3.12 "numpy<2" "cython<3" fastrlock git pytest cuda cuda-version=12.2
conda activate cupy_cuda122_dev
git clone --recursive https://github.com/cupy/cupy.git
cd cupy
export CUPY_NVCC_GENERATE_CODE="arch=compute_86,code=sm_86;"  # update to limit the build for your GPUs; a list of arch can be supplied with semi-colons
export CUPY_NUM_BUILD_JOBS=16
git submodule update --init --recursive
pip install --no-deps --no-build-isolation -v -e .
CUDA_PATH=$CONDA_PREFIX/targets/x86_64-linux/ pytest tests/cupy_tests/sorting_tests/ -x  # -x to fail on first error

bernhardmgruber · 2024-07-31T23:13:05Z

This is how we can quickly spin up a working environment as of today (including creating a working CUDA env):

...

I found out in the meantime that pip install -e . compiles the code, so that helped! If I have to track down any CI errors, I will certainly be glad to have your instructions!

takagi · 2024-08-01T04:50:52Z

/test full

bernhardmgruber · 2024-08-01T14:23:32Z

I tried running that to setup an environment

conda create -n cupy_cuda122_dev -y python=3.12 "numpy<2" "cython<3" fastrlock git pytest cuda cuda-version=12.2

but I get an error message that nothing provides __win, which sounds weird given I am running on Ubuntu 24.04.

I could setup the packages manually outside conda. However, running the tests like:

pytest tests/cupy_tests/sorting_tests/ -x

starts executing, but crashes my display driver after several seconds :S I am investigating.

mergify · 2024-08-07T06:00:39Z

This pull request is now in conflicts. Could you fix it @bernhardmgruber? 🙏

kmaehashi · 2024-08-08T07:04:18Z

It seems there was a conflict after merging #8412. Could you resolve conflicts @bernhardmgruber?

I could setup the packages manually outside conda. However, running the tests like:
pytest tests/cupy_tests/sorting_tests/ -x
starts executing, but crashes my display driver after several seconds :S I am investigating.

Several CuPy tests (with @testing.slow annotation) try to allocate few GB memory, so it may be the cause of the crash. Please try skipping them by python -m "not slow" tests/cupy_tests/sorting_tests/ -x.

bernhardmgruber · 2024-08-09T16:06:51Z

I rebased my changes and noticed that I had to readd the custom comparison operator for float and double. It seems that CUB's radix sort, after the CCCL update you did with #8412, no longer produces the desired order for NaNs.

I can run some more tests next week.

bernhardmgruber · 2024-08-12T08:20:55Z

I rebased my changes and noticed that I had to readd the custom comparison operator for float and double. It seems that CUB's radix sort, after the CCCL update you did with #8412, no longer produces the desired order for NaNs.

It seems NaNs order correctly with CTK 12.2 in the absence of custom comparison operators when running pytest tests/cupy_tests/sorting_tests/test_sort.py -k "not slow".

With CTK 12.4 I got some weird errors about failing compilation here and there, so my tests may have been wrong.

I would propose to let the CI be the arbiter of truth now. I don't trust my local setup :D

jakirkham · 2024-08-12T21:23:55Z

/test mini

jakirkham · 2024-08-13T04:13:00Z

Seeing a bunch of test failures on CI. Here is one example:

____________________________ TestLexsort.test_nan2 _____________________________
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:844: in test_func	
    impl(*args, **kw)	
        args       = (<cupy_tests.sorting_tests.test_sort.TestLexsort testMethod=test_nan2>,)	
        dtype      = 'f'	
        dtypes     = 'efdFD'	
        impl       = <function TestLexsort.test_nan2 at 0x7f855ebc0ae0>	
        kw         = {'dtype': <class 'numpy.float32'>}	
        name       = 'dtype'	
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:363: in test_func	
    check_func(cupy_r, numpy_r)	
        accept_error = False	
        args       = (<cupy_tests.sorting_tests.test_sort.TestLexsort testMethod=test_nan2>,)	
        check_func = <function numpy_cupy_array_equal.<locals>.check_func at 0x7f855ebc0680>	
        check_sparse_format = True	
        contiguous_check = False	
        cupy_error = None	
        cupy_numpy_result_ndarrays = [(array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7]), array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2]))]	
        cupy_r     = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])	
        cupy_result = (array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7]),)	
        impl       = <function TestLexsort.test_nan2 at 0x7f855ebc09a0>	
        kw         = {'dtype': <class 'numpy.float32'>}	
        mask       = None	
        masks      = [None]	
        name       = 'xp'	
        numpy_error = None	
        numpy_r    = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])	
        numpy_result = (array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2]),)	
        scipy_name = None	
        skip       = False	
        sp_name    = None	
        type_check = True	
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:667: in check_func	
    _array.assert_array_equal(x, y, err_msg, verbose, strides_check)	
        err_msg    = ''	
        strides_check = False	
        verbose    = True	
        x          = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])	
        y          = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])	
/root/.local/lib/python3.11/site-packages/cupy/testing/_array.py:95: in assert_array_equal	
    numpy.testing.assert_array_equal(	
        err_msg    = ''	
        kwargs     = {}	
        strides_check = False	
        verbose    = True	
        x          = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])	
        y          = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])	
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

kmaehashi · 2024-08-13T07:06:40Z

cupy_tests/sorting_tests/test_sort.py::TestLexsort::test_nan2 and cupy_tests/sorting_tests/test_sort.py::TestLexsort::test_nan3 look failing with all of CUDA 11.2/11.8/12.0/12.5.

jakirkham · 2024-08-13T07:39:33Z

/test mini

bernhardmgruber · 2024-08-13T11:49:51Z

Alright, I readded the specializations for float and double. The one failing test seems unrelated now:

FAILED cupyx_tests/scipy_tests/interpolate_tests/test_bspline.py::TestBSpline::test_bspline_derivative_jumps[_param_4_{c=[[-1, 2, 0, -1], [-1, 2, 0, -1], [-1, 2, 0, -1], [-1, 2, 0, -1], [-1, 2, 0, -1]], extrapolate=False}]

My understanding is the following:

we would not need the specializations for float and double when radix sort is used, because it already orderes NaNs to the back. This probably holds true for half as well (not tested). So this PR may be a pessizimation for those types now, which will now use merge sort.
we do need the custom comparison for elem_less used by _lexsort, because it actually compares using the functor.

We have two options:

Merge the current implementation (potentially pessimizing floating-point sorting). @leofang mentioned to me somewhere that he considers using CUB directly at some point, so then you would regain the perf of radix sort.
Implement additional handling whether the comparison is needed for radix sort or for elem_less. Retains the current perf, but implicitly relies on the assumption that radix sort will order NaNs implicitly.

What would you prefer?

jakirkham · 2024-08-14T05:04:18Z

^ @kmaehashi what do you think?

jakirkham · 2024-08-19T05:14:48Z

If we are unsure how to proceed here, would it make sense to defer this for the next release (IOW a later 13.x or simply 14.0.0) so we have more time to evaluate options (and possibly improve on them)?

kmaehashi · 2024-08-19T05:51:31Z

Sorry for dropping the ball, I think the performance of FP sorting should be prioritized over working with NaNs "correctly" (aligning with NumPy's specification), but would love to have inputs from other forks @takagi @asi1024 @leofang.

would it make sense to defer this for the next release

I agree this is not a blocker for the next release.

The current way of specializing thrust::less::operator() is uncommon in C++ and not forward compatible to recent and future changes to CCCL. The proper way is to use a custom comparison object with the desired behavior. For sorting floating-points, thrust::less is used directly, since then thrust uses radix sort, which sorts NaNs to the back.

bernhardmgruber · 2024-08-19T13:58:20Z

I implemented approach 2. now. You should see the same performance as before.

/test full

takagi · 2024-08-20T05:49:34Z

I agree with the current change that delegates floating points sort to thrust that uses radix sort for that.

kmaehashi · 2024-08-20T08:08:54Z

CIs are congested to prepare for the release this week. Let me cancel CI runs for this one 🙇 I will re-kick the CI afterwards.

leofang · 2024-08-26T12:36:31Z

Sorry for my late reply, slowly catching up after returning from time-off...

I think the performance of FP sorting should be prioritized over working with NaNs "correctly" (aligning with NumPy's specification),

IIRC during past array API meetings @seberg had noted that the NaN-sorting behavior must be consistent between sort & argsort, and choosing to sort them toward the end (i.e. following NumPy) is one such consistent choice but not the only one. I believe the note here is up-to-date: https://data-apis.org/array-api/2023.12/API_specification/sorting_functions.html

I implemented approach 2. now. You should see the same performance as before.

It seems the NaN discussion is now moot? 🙂

leofang · 2024-08-26T12:37:22Z

/test full

leofang · 2024-08-26T12:41:53Z

That said, one questions I raised (internally) is still unresolved: That Thrust uses radix sort under the hood is an implementation detail as far as CuPy is concerned. How do we proactively ensure that future Thrust updates would not break CuPy? This is a run-time behavior that needs an actual execution to confirm.

seberg · 2024-08-26T13:16:52Z

One thing was about top_k as a future function, which gives the K-largest or K-smallest values. If using a (partial) sort, that sort would be nice to formalize to one of:

undefined (always an OK choice if documented)
sort to end (ignore NaNs as much as possible, like a dataframe library would do it)
sort to front (equivalent to propagating NaNs)
Sorting to end is consistent with top_k not propagating NaNs (dataframe).

bernhardmgruber · 2024-08-28T13:53:29Z

It seems the NaN discussion is now moot? 🙂

The sorting order should be the same as before.

That Thrust uses radix sort under the hood is an implementation detail as far as CuPy is concerned. How do we proactively ensure that future Thrust updates would not break CuPy?

I see three options:

Program against the CUB API directly, because then the sorting algorithm is your choice. That's a lot of restructuring, but probably better long-term anyway, since you can also use radix sort for your tuples using the right adaptors.
Statically assert that the Thrust CUDA/CUB backend dispatches to radix sort: static_assert(thrust::cuda_cub::__smart_sort::can_use_primitive_sort<...>). However, this trait is internal. We could start a discussion of exposing it. This is the easy, but hacky route.
Merge the PR as.

Mind that I cannot promise much time to work on any of these, unfortunately :S

leofang · 2024-09-08T19:34:00Z

One thing was about top_k as a future function, which gives the K-largest or K-smallest values.

xref: data-apis/array-api#722

I see three options:

I think 3 (merging as-is) is the way to go for now, and let's track 1 as a long-term solution. Thanks, Bernhard!

leofang

Appreciate for the help, @bernhardmgruber, LGTM!

takagi · 2024-10-04T05:19:46Z

/test full

takagi

LGTM!

Use custom less instead of specializing thrust

takagi · 2024-10-09T07:51:24Z

Thanks @bernhardmgruber!

bernhardmgruber marked this pull request as ready for review July 31, 2024 13:57

bernhardmgruber mentioned this pull request Jul 31, 2024

[BUG]: Latest CCCL breaks CuPy NVIDIA/cccl#2109

Closed

1 task

jrhemstad reviewed Jul 31, 2024

View reviewed changes

bernhardmgruber force-pushed the thrust_less branch from 07ec664 to 71d712b Compare July 31, 2024 23:01

bernhardmgruber changed the title ~~Specialize thrust::less instead of its operator()~~ Use custom less instead of specializing thrust Jul 31, 2024

takagi self-assigned this Aug 1, 2024

takagi added cat:enhancement Improvements to existing features prio:medium to-be-backported Pull-requests to be backported to stable branch labels Aug 1, 2024

takagi added this to the v14.0.0a1 milestone Aug 1, 2024

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the thrust_less branch 3 times, most recently from 543f47b to d15a26e Compare August 1, 2024 20:22

bernhardmgruber force-pushed the thrust_less branch 2 times, most recently from 68ece90 to dd145ae Compare August 9, 2024 14:52

bernhardmgruber force-pushed the thrust_less branch from dd145ae to 323802d Compare August 12, 2024 08:16

bernhardmgruber force-pushed the thrust_less branch from 323802d to 6d68ccf Compare August 13, 2024 07:21

kmaehashi removed this from the v14.0.0a1 milestone Aug 19, 2024

bernhardmgruber force-pushed the thrust_less branch from 6d68ccf to 19f1fcb Compare August 19, 2024 13:33

bernhardmgruber force-pushed the thrust_less branch from 19f1fcb to 4141f4a Compare August 19, 2024 13:56

Copy link

Contributor

takagi commented Aug 20, 2024

/test full

leofang approved these changes Sep 8, 2024

View reviewed changes

takagi added this to the v14.0.0a1 milestone Oct 4, 2024

takagi approved these changes Oct 9, 2024

View reviewed changes

takagi merged commit d1c4af0 into cupy:main Oct 9, 2024
61 checks passed

chainer-ci pushed a commit to chainer-ci/cupy that referenced this pull request Oct 9, 2024

Merge pull request cupy#8446 from bernhardmgruber/thrust_less

8147548

Use custom less instead of specializing thrust

chainer-ci mentioned this pull request Oct 9, 2024

[backport] Use custom less instead of specializing thrust #8653

Merged

bernhardmgruber deleted the thrust_less branch October 10, 2024 12:04

bernhardmgruber mentioned this pull request Dec 19, 2024

Use a custom Min/Max instead of specializing CUB #8846

Merged

kmaehashi mentioned this pull request Feb 26, 2025

[v13] Bump version to v13.4.0 #8993

Merged

kmaehashi mentioned this pull request Jun 26, 2025

[v13] Bump version to v13.5.0 #9212

Merged

Uh oh!

Use custom less instead of specializing thrust #8446

Use custom less instead of specializing thrust #8446

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!