8000 Use custom less instead of specializing thrust by bernhardmgruber · Pull Request #8446 · cupy/cupy · GitHub
[go: up one dir, main page]

Skip to content

Use custom less instead of specializing thrust #8446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 9, 2024

Conversation

bernhardmgruber
Copy link
Contributor
@bernhardmgruber bernhardmgruber commented Jul 31, 2024

The current way of specializing thrust::less::operator() is uncommon in C++ and not forward compatible to recent and future changes to CCCL. The proper way is to use a custom comparison object with the desired behavior.

I could not figure out how to run the unit tests covering the C++ part from your documentation. pip install -e . ran successfully, but python -m pytest shows some errors due to missing packages that I failed to install. I also have to admit, that my Python skills a horribly limited :S

Besides, there are some further improvements that could be made to cupy_thrust.cu, which depend on your minimum C++ standard and CCCL/rocthrust version. E.g. make_zip_iterator does not require make_tuple anymore.
As a drive-by improvement, I dropped make_tuple when calling make_zip_iterator, which is no longer required.

@bernhardmgruber bernhardmgruber marked this pull request as ready for review July 31, 2024 13:57
#endif
bool thrust::less<float>::operator() (
const float& lhs, const float& rhs) const {
struct thrust::less<float> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: I would think that we'd consider adding user-defined specializations for thrust::less<float> to still be invalid.

For example, I believe in thrust::sort we dispatch to a merge sort vs radix sort implementation based on the operator. If the operator is a well known operator for a well known type (like thrust::less).

If a user provides their own specialization of that operator, then that dispatch would no longer be valid because the radix sort code path wouldn't actually use their operator specialization.

TL;DR: I think the easiest solution here is to change this code from using and specializing thrust::less directly and just add a custom comparison type (like cupy::less) and pass that into the thrust::sort call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: I think the easiest solution here is to change this code from using and specializing thrust::less directly and just add a custom comparison type (like cupy::less) and pass that into the thrust::sort call.

Yes this sounds good to me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suspected, calling thrust::sort(..., thrust::less<float>()) with a custom specialization of thrust::less<float> will never invoke that specialization. It dispatches to the RadixSort code path.

So cupy is just getting lucky that CUB's RadixSort happens to order NaNs in a way that matches whatever their tests expect.

https://godbolt.org/z/scezav1Yv

Same story for thrust::stable_sort: https://godbolt.org/z/n56EPbo3Y

TL;DR: This is all the more reason that specializing thrust::less is incorrect and cupy needs to define it's own custom cupy::less operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cupy is just getting lucky that CUB's RadixSort happens to order NaNs in a way that matches whatever their tests expect.

This is actually not true. Let's discuss this example offline tomorrow when Bernhard is up:

>>> a = cp.random.random((3, 5), dtype=cp.float32)
>>> cp.sort(a, axis=-1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: I would think that we'd consider adding user-defined specializations for thrust::less<float> to still be invalid.

Yes we do. But continuing to provide a specialization for thrust::less seemed the least different change and it would at least work with the planned changes in CCCL.

For example, I believe in thrust::sort we dispatch to a merge sort vs radix sort implementation based on the operator. If the operator is a well known operator for a well known type (like thrust::less).

If a user provides their own specialization of that operator, then that dispatch would no longer be valid because the radix sort code path wouldn't actually use their operator specialization.

This is what's probably happening here.

TL;DR: I think the easiest solution here is to change this code from using and specializing thrust::less directly and just add a custom comparison type (like cupy::less) and pass that into the thrust::sort call.

I had to do a more complicated version, because we only want a custom less function object for the cases where the comparison semantics are changed. For integers, we would still want to use thrust::less to benefit, e.g., from radix sort.

@leofang
Copy link
Member
leofang commented Jul 31, 2024

Thanks, @bernhardmgruber! Appreciate for the help!

I could not figure out how to run the unit tests covering the C++ part from your documentation. pip install -e . ran successfully, but python -m pytest shows some errors due to missing packages that I failed to install. I also have to admit, that my Python skills a horribly limited :S

Not sure what kind of errors that you ran into, care to share the full output?

This is how we can quickly spin up a working environment as of today (including creating a working CUDA env):

conda create -n cupy_cuda122_dev -y python=3.12 "numpy<2" "cython<3" fastrlock git pytest cuda cuda-version=12.2
conda activate cupy_cuda122_dev
git clone --recursive https://github.com/cupy/cupy.git
cd cupy
export CUPY_NVCC_GENERATE_CODE="arch=compute_86,code=sm_86;"  # update to limit the build for your GPUs; a list of arch can be supplied with semi-colons
export CUPY_NUM_BUILD_JOBS=16
git submodule update --init --recursive
pip install --no-deps --no-build-isolation -v -e .
CUDA_PATH=$CONDA_PREFIX/targets/x86_64-linux/ pytest tests/cupy_tests/sorting_tests/ -x  # -x to fail on first error

@bernhardmgruber
Copy link
Contributor Author
bernhardmgruber commented Jul 31, 2024

This is how we can quickly spin up a working environment as of today (including creating a working CUDA env):

...

I found out in the meantime that pip install -e . compiles the code, so that helped! If I have to track down any CI errors, I will certainly be glad to have your instructions!

@bernhardmgruber bernhardmgruber changed the title Specialize thrust::less instead of its operator() Use custom less instead of specializing thrust Jul 31, 2024
@takagi takagi self-assigned this Aug 1, 2024
@takagi takagi added cat:enhancement Improvements to existing features prio:medium to-be-backported Pull-requests to be backported to stable branch labels Aug 1, 2024
@takagi
Copy link
Contributor
takagi commented Aug 1, 2024

/test full

@takagi takagi added this to the v14.0.0a1 milestone Aug 1, 2024
@leofang

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor Author

I tried running that to setup an environment

conda create -n cupy_cuda122_dev -y python=3.12 "numpy<2" "cython<3" fastrlock git pytest cuda cuda-version=12.2

but I get an error message that nothing provides __win, which sounds weird given I am running on Ubuntu 24.04.

I could setup the packages manually outside conda. However, running the tests like:

pytest tests/cupy_tests/sorting_tests/ -x

starts executing, but crashes my display driver after several seconds :S I am investigating.

@bernhardmgruber bernhardmgruber force-pushed the thrust_less branch 3 times, most recently from 543f47b to d15a26e Compare August 1, 2024 20:22
Copy link
Contributor
mergify bot commented Aug 7, 2024

This pull request is now in conflicts. Could you fix it @bernhardmgruber? 🙏

@kmaehashi
Copy link
Member

It seems there was a conflict after merging #8412. Could you resolve conflicts @bernhardmgruber?

I could setup the packages manually outside conda. However, running the tests like:

pytest tests/cupy_tests/sorting_tests/ -x

starts executing, but crashes my display driver after several seconds :S I am investigating.

Several CuPy tests (with @testing.slow annotation) try to allocate few GB memory, so it may be the cause of the crash. Please try skipping them by python -m "not slow" tests/cupy_tests/sorting_tests/ -x.

@bernhardmgruber bernhardmgruber force-pushed the thrust_less branch 2 times, most recently from 68ece90 to dd145ae Compare August 9, 2024 14:52
@bernhardmgruber
Copy link
Contributor Author

I rebased my changes and noticed that I had to readd the custom comparison operator for float and double. It seems that CUB's radix sort, after the CCCL update you did with #8412, no longer produces the desired order for NaNs.

I can run some more tests next week.

@bernhardmgruber
Copy link
Contributor Author

I rebased my changes and noticed that I had to readd the custom comparison operator for float and double. It seems that CUB's radix sort, after the CCCL update you did with #8412, no longer produces the desired order for NaNs.

It seems NaNs order correctly with CTK 12.2 in the absence of custom comparison operators when running pytest tests/cupy_tests/sorting_tests/test_sort.py -k "not slow".

With CTK 12.4 I got some weird errors about failing compilation here and there, so my tests may have been wrong.

I would propose to let the CI be the arbiter of truth now. I don't trust my local setup :D

@jakirkham
Copy link
Member

/test mini

@jakirkham
Copy link
Member

Seeing a bunch of test failures on CI. Here is one example:

____________________________ TestLexsort.test_nan2 _____________________________
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:844: in test_func	
    impl(*args, **kw)	
        args       = (<cupy_tests.sorting_tests.test_sort.TestLexsort testMethod=test_nan2>,)	
        dtype      = 'f'	
        dtypes     = 'efdFD'	
        impl       = <function TestLexsort.test_nan2 at 0x7f855ebc0ae0>	
        kw         = {'dtype': <class 'numpy.float32'>}	
        name       = 'dtype'	
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:363: in test_func	
    check_func(cupy_r, numpy_r)	
        accept_error = False	
        args       = (<cupy_tests.sorting_tests.test_sort.TestLexsort testMethod=test_nan2>,)	
        check_func = <function numpy_cupy_array_equal.<locals>.check_func at 0x7f855ebc0680>	
        check_sparse_format = True	
        contiguous_check = False	
        cupy_error = None	
        cupy_numpy_result_ndarrays = [(array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7]), array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2]))]	
        cupy_r     = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])	
        cupy_result = (array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7]),)	
        impl       = <function TestLexsort.test_nan2 at 0x7f855ebc09a0>	
        kw         = {'dtype': <class 'numpy.float32'>}	
        mask       = None	
        masks      = [None]	
        name       = 'xp'	
        numpy_error = None	
        numpy_r    = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])	
        numpy_result = (array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2]),)	
        scipy_name = None	
        skip       = False	
        sp_name    = None	
        type_check = True	
/root/.local/lib/python3.11/site-packages/cupy/testing/_loops.py:667: in check_func	
    _array.assert_array_equal(x, y, err_msg, verbose, strides_check)	
        err_msg    = ''	
        strides_check = False	
        verbose    = True	
        x          = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])	
        y          = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])	
/root/.local/lib/python3.11/site-packages/cupy/testing/_array.py:95: in assert_array_equal	
    numpy.testing.assert_array_equal(	
        err_msg    = ''	
        kwargs     = {}	
        strides_check = False	
        verbose    = True	
        x          = array([4, 8, 0, 7, 9, 3, 2, 6, 8, 7])	
        y          = array([6, 4, 5, 1, 8, 0, 7, 9, 3, 2])	
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

@kmaehashi
Copy link
Member

cupy_tests/sorting_tests/test_sort.py::TestLexsort::test_nan2 and cupy_tests/sorting_tests/test_sort.py::TestLexsort::test_nan3 look failing with all of CUDA 11.2/11.8/12.0/12.5.

@jakirkham
Copy link
Member

/test mini

@bernhardmgruber
Copy link
Contributor Author

Alright, I readded the specializations for float and double. The one failing test seems unrelated now:

FAILED cupyx_tests/scipy_tests/interpolate_tests/test_bspline.py::TestBSpline::test_bspline_derivative_jumps[_param_4_{c=[[-1, 2, 0, -1], [-1, 2, 0, -1], [-1, 2, 0, -1], [-1, 2, 0, -1], [-1, 2, 0, -1]], extrapolate=False}]

My understanding is the following:

  • we would not need the specializations for float and double when radix sort is used, because it already orderes NaNs to the back. This probably holds true for half as well (not tested). So this PR may be a pessizimation for those types now, which will now use merge sort.
  • we do need the custom comparison for elem_less used by _lexsort, because it actually compares using the functor.

We have two options:

  1. Merge the current implementation (potentially pessimizing floating-point sorting). @leofang mentioned to me somewhere that he considers using CUB directly at some point, so then you would regain the perf of radix sort.
  2. Implement additional handling whether the comparison is needed for radix sort or for elem_less. Retains the current perf, but implicitly relies on the assumption that radix sort will order NaNs implicitly.

What would you prefer?

@jakirkham
Copy link
Member

^ @kmaehashi what do you think?

@jakirkham
Copy link
Member
jakirkham commented Aug 19, 2024

If we are unsure how to proceed here, would it make sense to defer this for the next release (IOW a later 13.x or simply 14.0.0) so we have more time to evaluate options (and possibly improve on them)?

@kmaehashi
Copy link
Member

Sorry for dropping the ball, I think the performance of FP sorting should be prioritized over working with NaNs "correctly" (aligning with NumPy's specification), but would love to have inputs from other forks @takagi @asi1024 @leofang.

would it make sense to defer this for the next release

I agree this is not a blocker for the next release.

@kmaehashi kmaehashi removed this from the v14.0.0a1 milestone Aug 19, 2024
The current way of specializing thrust::less::operator() is uncommon in C++ and not forward compatible to recent and future changes to CCCL. The proper way is to use a custom comparison object with the desired behavior.

For sorting floating-points, thrust::less is used directly, since then thrust uses radix sort, which sorts NaNs to the back.
@bernhardmgruber
Copy link
Contributor Author

I implemented approach 2. now. You should see the same performance as before.

@takagi
Copy link
Contributor
takagi commented Aug 20, 2024

/test full

@takagi
Copy link
Contributor
takagi commented Aug 20, 2024

I agree with the current change that delegates floating points sort to thrust that uses radix sort for that.

@kmaehashi
Copy link
Member

CIs are congested to prepare for the release this week. Let me cancel CI runs for this one 🙇 I will re-kick the CI afterwards.

@leofang
Copy link
Member
leofang commented Aug 26, 2024

Sorry for my late reply, slowly catching up after returning from time-off...

I think the performance of FP sorting should be prioritized over working with NaNs "correctly" (aligning with NumPy's specification),

IIRC during past array API meetings @seberg had noted that the NaN-sorting behavior must be consistent between sort & argsort, and choosing to sort them toward the end (i.e. following NumPy) is one such consistent choice but not the only one. I believe the note here is up-to-date: https://data-apis.org/array-api/2023.12/API_specification/sorting_functions.html

I implemented approach 2. now. You should see the same performance as before.

It seems the NaN discussion is now moot? 🙂

@leofang
Copy link
Member
leofang commented Aug 26, 2024

/test full

@leofang
Copy link
Member
leofang commented Aug 26, 2024

That said, one questions I raised (internally) is still unresolved: That Thrust uses radix sort under the hood is an implementation detail as far as CuPy is concerned. How do we proactively ensure that future Thrust updates would not break CuPy? This is a run-time behavior that needs an actual execution to confirm.

@seberg
Copy link
Contributor
seberg commented Aug 26, 2024

One thing was about top_k as a future function, which gives the K-largest or K-smallest values. If using a (partial) sort, that sort would be nice to formalize to one of:

  • undefined (always an OK choice if documented)
  • sort to end (ignore NaNs as much as possible, like a dataframe library would do it)
  • sort to front (equivalent to propagating NaNs)
    Sorting to end is consistent with top_k not propagating NaNs (dataframe).

@bernhardmgruber
Copy link
Contributor Author

It seems the NaN discussion is now moot? 🙂

The sorting order should be the same as before.

That Thrust uses radix sort under the hood is an implementation detail as far as CuPy is concerned. How do we proactively ensure that future Thrust updates would not break CuPy?

I see three options:

  1. Program against the CUB API directly, because then the sorting algorithm is your choice. That's a lot of restructuring, but probably better long-term anyway, since you can also use radix sort for your tuples using the right adaptors.
  2. Statically assert that the Thrust CUDA/CUB backend dispatches to radix sort: static_assert(thrust::cuda_cub::__smart_sort::can_use_primitive_sort<...>). However, this trait is internal. We could start a discussion of exposing it. This is the easy, but hacky route.
  3. Merge the PR as.

Mind that I cannot promise much time to work on any of these, unfortunately :S

@leofang
Copy link
Member
leofang commented Sep 8, 2024

One thing was about top_k as a future function, which gives the K-largest or K-smallest values.

xref: data-apis/array-api#722

I see three options:

I think 3 (merging as-is) is the way to go for now, and let's track 1 as a long-term solution. Thanks, Bernhard!

Copy link
Member
@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate for the help, @bernhardmgruber, LGTM!

@takagi
Copy link
Contributor
takagi commented Oct 4, 2024

/test full

@takagi takagi added this to the v14.0.0a1 milestone Oct 4, 2024
Copy link
Contributor
@takagi takagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@takagi takagi merged commit d1c4af0 into cupy:main Oct 9, 2024
61 checks passed
chainer-ci pushed a commit to chainer-ci/cupy that referenced this pull request Oct 9, 2024
Use custom less instead of specializing thrust
@takagi
Copy link
Contributor
takagi commented Oct 9, 2024

Thanks @bernhardmgruber!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:enhancement Improvements to existing features prio:medium to-be-backported Pull-requests to be backported to stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0