8000 ENH: Fast paths for richcompare by ganesh-k13 · Pull Request #17970 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Fast paths for richcompare #17970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

ganesh-k13
Copy link
Member
@ganesh-k13 ganesh-k13 commented Dec 9, 2020

Changes:

  1. Added check to see if RHS is a python number and prevent the creation of array.
  2. Added bench

Benchmarks:

Benchmarks:
                                                                                                                   
********************************************************************************                                                                          
WARNING: you have uncommitted changes --- these will NOT be benchmarked!                                                                                  
********************************************************************************                                                                          
· Creating environments                                                      
· Discovering benchmarks                                                     
·· Uninstalling from virtualenv-py3.8-Cython                                 
·· Building 933620b4  for virtualenv-py3.8-Cython..................................                                                                                                                                                                                                          
·· Installing 933620b4  into virtualenv-py3.8-Cython.                                                                             
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)                                                                                  
[  0.00%] · For numpy commit 4d290795  (round 1/2):                                                                                               
[  0.00%] ·· Building for virtualenv-py3.8-Cython.                                                                                                                                                                                                                                                                   
[  0.00%] ·· Benchmarking virtualenv-py3.8-Cython                                                                                                         
[ 25.00%] ··· Running (bench_ufunc.CustomScalarEqual.time_scalar_equality--).                                                                             
[ 25.00%] · For numpy commit 933620b4  (round 1/2):                                                                               
[ 25.00%] ·· Building for virtualenv-py3.8-Cython.                                                                                                        
[ 25.00%] ·· Benchmarking virtualenv-py3.8-Cython                                                                                                         
[ 50.00%] ··· Running (bench_ufunc.CustomScalarEqual.time_scalar_equality--).                                                                             
[ 50.00%] · For numpy commit 933620b4  (round 2/2):                                                                               
[ 50.00%] ·· Benchmarking virtualenv-py3.8-Cython                                                                                                         
[ 75.00%] ··· bench_ufunc.CustomScalarEqual.time_scalar_equality                                                                                                              ok                                                                                                                                     
[ 75.00%] ··· =============== ==========                                     
                   dtype                                                                                                                                  
              --------------- ----------                                     
                 numpy.int8    439±6ms                                       
                numpy.int16    445±8ms                                       
                numpy.int32    435±10ms                                      
                numpy.int64    147±2ms                                       
                numpy.uint8    427±10ms                                      
                numpy.uint16   444±8ms                                       
                numpy.uint32   437±10ms                                      
                numpy.uint64   447±4ms                                       
               numpy.float32   471±10ms                                      
               numpy.float64   180±4ms                                                                                                                    
              =============== ==========                                                                                                                  
                                                                             
[ 75.00%] · For numpy commit 4d290795  (round 2/2):                                                                                               
[ 75.00%] ·· Building for virtualenv-py3.8-Cython.                                 
8000
                                                                       
[ 75.00%] ·· Benchmarking virtualenv-py3.8-Cython                                                                                                         
[100.00%] ··· bench_ufunc.CustomScalarEqual.time_scalar_equality                                                                                                              ok                                                                                                                                     
[100.00%] ··· =============== ============                                   
                   dtype                                                     
              --------------- ------------                                   
                 numpy.int8     1.35±0s                                      
                numpy.int16    1.34±0.01s                                    
                numpy.int32    1.34±0.02s                                    
                numpy.int64     147±3ms                                      
                numpy.uint8     1.34±0s                                      
                numpy.uint16    1.36±0s                                      
                numpy.uint32    1.35±0s                                      
                numpy.uint64   1.54±0.01s                                    
               numpy.float32    1.54±0s                                      
               numpy.float64    184±4ms                                      
              =============== ============                                   

       before           after         ratio                                  
     [4d290795]       [933620b4]                                             
                                                                                                                          
-      1.34±0.01s          445±8ms     0.33  bench_ufunc.CustomScalarEqual.time_scalar_equality()                                                                                                                                                                                               
-         1.36±0s          444±8ms     0.33  bench_ufunc.CustomScalarEqual.time_scalar_equality()                                                                                                                                                                                              
-      1.34±0.02s         435±10ms     0.33  bench_ufunc.CustomScalarEqual.time_scalar_equality()                                                                                                                                                                                               
-         1.35±0s          439±6ms     0.33  bench_ufunc.CustomScalarEqual.time_scalar_equality()                                                                                                                                                                                                
-         1.35±0s         437±10ms     0.32  bench_ufunc.CustomScalarEqual.time_scalar_equality()                                                                                                                                                                                              
-         1.34±0s         427±10ms     0.32  bench_ufunc.CustomScalarEqual.time_scalar_equality()                                                                                                                                                                                               
-         1.54±0s         471±10ms     0.31  bench_ufunc.CustomScalarEqual.time_scalar_equality()                                                                                                                                                                                             
-      1.54±0.01s          447±4ms     0.29  bench_ufunc.CustomScalarEqual.time_scalar_equality()                                                                                                                                                                                              

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.                                  
PERFORMANCE INCREASED.                                                       

resolves: #14415

cc: @seberg , @eric-wieser , @mattip

@seberg
Copy link
Member
seberg commented Dec 9, 2020

Thanks, I will have to think about it more. I do guess that it might be nice to intercept some of this even earlier, but I do not mind injecting a fast-path there it is practical and doesn't add much complexity.

One thing to think about .item() has two awkward cases:

  1. Low precision floating point numbers are upcast to float64, I expect that is fine, but did not think it through.
  2. High precision floating point numbers (longdouble, clongdouble) are unchanged, that means the fast-path is definitely broken for them. Maybe that is the test-failure, maybe not.

(To be fair, the 2. part is already broken in some similar paths, but this would break it more/worse.)

@ganesh-k13
Copy link
Member Author

Thanks for the review @seberg , will test it out 👍

ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull request Dec 10, 2020
@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch from fcba612 to 3d1f4cd Compare December 10, 2020 10:16
@ganesh-k13
Copy link
Member Author
ganesh-k13 commented Dec 10, 2020

I tried something like this:

* #Name = Long*11, Float*3,,,#
* #fastpath = 1*14, 0*3#

.....

   case -2:
        /* use ufunc */
        if (PyErr_Occurred()) {
            return NULL;
        }
#if @fastpath@
        if (PyArray_IsPythonNumber(other)) {
            PyObject *scalar, *result;
            scalar = Py@Name@_From@Name@(arg1);
            result = PyObject_RichCompare(scalar, other, cmp_op);
            if (result == NULL) {
                return NULL;
            }
            out = PyObject_IsTrue(result);
            if (out == -1) {
                return NULL;
            }
            Py_DECREF(result);
            goto done;
        }
#endif
        return PyGenericArrType_Type.tp_richcompare(self, other, cmp_op);

But it fails for np.int64(-1) == 18446744073709551615 because arg1 becomes -1. Even if I replace the code with .item I get same, anything I can do differently here?

[EDIT]
This is in (I had pasted power code before, my bad)

case -2:
/* use ufunc */
if (PyErr_Occurred()) {
return NULL;
}
return PyGenericArrType_Type.tp_richcompare(self, other, cmp_op);

@seberg
Copy link
Member
seberg commented Dec 10, 2020

Ah, right, I am not sure you need the * #fastpath = 1*14, 0*3# addition, because longdouble takes the -3 case?

Why are you converting the obj to the scalar instead of the other way around? Your old code with item = gentype_generic_method(self, NULL, NULL, "item"); made more sense to me (albeit it should be replaced with the correct C-side incarnation).

It all looks weird, but when other is an arbitrary Python object, and the numpy scalar is not longdouble/clongdouble, we actually currently already do:

  1. Call ufuncs
  2. Convert both into an array
    • The second object will be an object dtype array
  3. Cast the original scalar (now an array) to object so that it has the same type as the other
    • This cast, for the main builtin numerical types, actually does arr.item() and not just arr[()].

So in the end we already operate on Python objects types, just in a very slow and convoluted way.

@ganesh-k13
Copy link
Member Author
ganesh-k13 commented Dec 10, 2020

Oh I see, yeah I missed the _convert2_to_ctypes logic. gentype_generic_method was static to the file, hence needed conversion to a scalar to call its richcompare, hence the obj to scalar.

Given that the original commit is showing significant gains in performance, I'll dig in a bit to see where we are slowing down.

albeit it should be replaced with the correct C-side incarnation

Yeah, probably moving previous code up a function call should do it, Getting the item in the C way will do it. will do so after reading a bit.

@seberg
Copy link
Member
seberg commented Dec 10, 2020

Ohh, I think you are testing mixed types? I.e. one of them is an arbitrary type, the other is always a default integer (probably C long?)... I.e. int8 + int64 is super slow, but int64 + int8 is fine.

IIRC those paths were messed up, because we did not have a correct "hierarchy" in the code. I tried to clean it up, but it turned out a pretty big PR and I never opened it, I should probably reactivate. There was a problem in that what we should be doing is:

if other.dtype.type_num > self.dtype.type_num:
    return NotImplemented

(At least that seemed like the easiest solution). Right now I think what happens it that if the first dtype is the smaller type, we will always end up in the super-slow path. With the return NotImplemented fix, we at least only end up in the super-slow path if (a + b).dtype is not in (a.dtype, b.dtype), which is much better.

More complex or generic schemes seemed like over-engineering to me, as they would have to duplicate a lot of ufunc logic which is likely not going to end up very light-weight in any case.

I would love if if you can find a more minimal fix, but in many ways I think this code requires an overhaul :(. Which means rewriting a bunch of it.

@ganesh-k13
Copy link
Member Author

I started with your branch and resolved the conflicts(not logical conflicts, as it does not compile yet :) ) here: ganesh-k13#1.

I'll be happy to resume work on it and finish it if you feel it helps this part of the code.

@seberg
Copy link
Member
seberg commented Dec 10, 2020

I feel it would help a lot, but I don't quite remember the state :), this code hasn't been touched in more than a decade, at least not seriously. The code is super convoluted, and I at least think it made it quite a bit more easy to grasp the logic (and probably actually fixes a few bugs).

@ganesh-k13
Copy link
Member Author

Ohh I see. Cool yeah, I'll work on that branch. Thanks.

@ganesh-k13 ganesh-k13 marked this pull request as draft December 10, 2020 17:00
@ganesh-k13
Copy link
Member Author

Hey @seberg , I leveraged some stuff from the branch and added fastpath in the -1 case where it's safe to cast. Let me know your thoughts on this.

@ganesh-k13
Copy link
Member Author
       before           after         ratio
     [4d290795]       [3115c165]
     <master>         <enh_14415_fast_compare>
-      23.7±0.1μs      11.7±0.04μs     0.49  bench_scalar.ScalarMath.time_compare('int32')
-      23.9±0.3μs       11.7±0.2μs     0.49  bench_scalar.ScalarMath.time_compare('int16')
-      25.5±0.3μs       12.4±0.2μs     0.49  bench_scalar.ScalarMath.time_compare('complex64')
-     25.5±0.05μs      11.9±0.09μs     0.47  bench_scalar.ScalarMath.time_compare('float32')
-      26.1±0.4μs      11.9±0.05μs     0.45  bench_scalar.ScalarMath.time_compare('float16')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull request Dec 16, 2020
@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch 2 times, most recently from 51aeb7c to 52e8085 Compare December 16, 2020 16:48
@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch from d945e26 to dc18734 Compare December 21, 2020 04:33
@ganesh-k13
Copy link
Member Author

Took some time to repro the issue, we hit a NULL pointer dereference(Py_DECREF on NULL) in complex number cases. Will fix it.

@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch from 3055f37 to e9a69eb Compare December 25, 2020 18:13
@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch from e9a69eb to 08dccbb Compare January 11, 2021 12:47
@ganesh-k13
Copy link
Member Author

Hi @seberg , sorry for the long pause in the PR.
I noticed the following TC causing crash in specific environment. I think handling return value for self_descr for overflow should take care of it.
Now to the overflow:

types = np.sctypes['uint']
for T in types:
assert_equal(iinfo(T).max, T(-1))

Is it defined behaviour to do np.uint(-1)?

@charris
Copy link
Member
charris commented Jan 11, 2021

Is it defined behaviour to do np.uint(-1)?

I think so, although it may not be formalized. C allows it, but maybe we should make an explicit choice.

@seberg
Copy link
Member
seberg commented Jan 11, 2021

Lets say that it was once explicitly implemented that uint(-1) will work, so for now I will say yes. For comparisons it would be best if this can't happen (the only problematic thing is probably uint64 == int64 using float and losing precision though).

It is something that we could think about changing, but I am not sure it is worth the trouble and how annoying the work around is (since using -1 for maxint is not uncommon, even if technicall undefined by the C standard.)

BUG: handled ret and overflow

BUG: Fixed crash in do_richcompare_on_scalars | Added complex check on fastpath

BUG: Added scalar check for self

MAINT: Refactor PyLong descr creation

MAINT: Moved header location

MAINT: Added XDECREF | Refactor complex equality check

BUG: Fixed error in complex check logic
@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch from 2d87920 to 13e9484 Compare February 10, 2021 09:03
@ganesh-k13
Copy link
Member Author

Updated bench for changes:

       before           after         ratio
     [f6a71fea]       [13e94849]
     <master>         <enh_14415_fast_compare>
+         455±2ns         545±60ns     1.20  bench_scalar.ScalarMath.time_abs('longfloat')
+         490±7ns        585±100ns     1.19  bench_scalar.ScalarMath.time_addition('int64')
+         539±6ns         579±20ns     1.07  bench_scalar.ScalarMath.time_addition('complex256')
+       538±0.6ns         571±20ns     1.06  bench_scalar.ScalarMath.time_addition('longfloat')
+         485±1ns          513±6ns     1.06  bench_scalar.ScalarMath.time_multiplication('float64')
+         483±5ns         511±20ns     1.06  bench_scalar.ScalarMath.time_addition('float64')
+       456±0.7ns          482±3ns     1.06  bench_scalar.ScalarMath.time_abs('complex128')
+         489±6ns          515±5ns     1.05  bench_scalar.ScalarMath.time_multiplication('int64')
-      82.9±0.6μs       54.8±0.5μs     0.66  bench_scalar.ScalarMath.time_compare_types('float64')
-         128±4μs         75.4±2μs     0.59  bench_scalar.ScalarMath.time_compare_types('int64')
-         152±7μs         76.3±2μs     0.50  bench_scalar.ScalarMath.time_compare_types('int32')
-         117±8μs       58.7±0.2μs     0.50  bench_scalar.ScalarMath.time_compare_types('complex64')
-         161±4μs         77.5±1μs     0.48  bench_scalar.ScalarMath.time_compare_types('int16')
-         129±2μs       59.1±0.4μs     0.46  bench_scalar.ScalarMath.time_compare_types('float32')
-     19.6±0.05μs       8.61±0.1μs     0.44  bench_scalar.ScalarMath.time_compare('int32')
-      20.0±0.4μs       8.36±0.2μs     0.42  bench_scalar.ScalarMath.time_compare('int16')
-      21.7±0.5μs      8.94±0.03μs     0.41  bench_scalar.ScalarMath.time_compare('complex64')
-      21.8±0.1μs       8.73±0.2μs     0.40  bench_scalar.ScalarMath.time_compare('float32')

@seberg seberg self-requested a review February 10, 2021 15:34
is_complex_operands = (PyTypeNum_ISCOMPLEX(self_descr->type_num) ||
PyTypeNum_ISCOMPLEX(other_descr->type_num));
is_equality_operator = (cmp_op != Py_EQ && cmp_op != Py_NE);
is_python_scalar = PyDescr_PythonCRepresentable(self_descr) ||
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a less confusing name would be is_not_* and PyDescr_Not*, will fix, aim is to basically check if the value can be represented as a native c type that python allows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the "C" probably, we are aiming for converting (safely) to a native Python type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, also the PyArray_Is* functions that are already present do not have a proper check for native python types.

Copy link
Member Author
@ganesh-k13 ganesh-k13 Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem being that, if we cannot represent in a PyObject, the richcompare seems to cause problem, long double being one of the victims.
[EDIT] PyObject via a c-python api

@ganesh-k13
Copy link
Member Author
ganesh-k13 commented Feb 10, 2021

Also I found a SIGABORT in the bench where I added a check for a type compared with all other dtype here. But no crash in UT. In particualr it means we lack a test where a clongdouble is never compared with int. I feel we must add this test somewhere. Any good location to add this?

EDIT: I fixed this case by adding the 1 or 0 check for return of descr_from_basic_scalar in this PR.

@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch from 13e9484 to ae8d903 Compare February 12, 2021 06:04
@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch from 9bcb421 to 5a6cc59 Compare February 15, 2021 13:19
@ganesh-k13 ganesh-k13 force-pushed the enh_14415_fast_compare branch from 5a6cc59 to 3b6ace3 Compare February 16, 2021 04:06
@ganesh-k13
Copy link
Member Author

Hey @seberg, let me know any other changes needed here. Thanks.

@ganesh-k13
Copy link
Member Author

I think the main pending items from side are the tests, overflow and few other combinations, will add them.

Base automatically changed from master to main March 4, 2021 02:05
@seberg
Copy link
Member
seberg commented Mar 7, 2021

Thanks @ganesh-k13, I will prioritize this a bit soon. gh-18548 also makes this a bit timely, although it is probably somewhat orthogonal it struggles with similar problems...

@seberg
Copy link
Member
seberg commented Mar 10, 2021

For myself, maybe it does not matter: But if I get it right, doing this would basically cement the behaviour in gh-10322 for scalars. I really hope we can get away with fixing gh-10322 though, because if we can't the whole "weak python scalar" idea might break down. But if it breaks down, I am left with shambles of confusion because there is no likely chance of a consistent future :(.

@ganesh-k13
Copy link
Member Author

Yeah Sebastian, I liked this idea from that discussion:

It does seem to me array scalars should always be treated like arrays.

Though this PR does seem to give a massive boost to scalars, it does feel like a patchup rather than something actually intedend by design. Let's have a wider discussions on future of scalars to decide a more solid or rather uniform way of handling them

@seberg
Copy link
Member
seberg commented Nov 3, 2022

Going to close, since the merge conflicts will be large now. There are still things that this tries to speed up that are not sped up in the meantime (e.g. 1j == np.float64(1)), but a few other paths are improved anyway.
Hope you don't mind, if you do just reopen :).

@seberg seberg closed this Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Comparisons between numpy scalars and python scalars are very slow
3 participants
0