BUG: handle introsort depth limit properly #7871

juliantaylor · 2016-07-26T19:02:22Z

each partition needs its own depth limit else it grows faster than
logarithmic ...

charris · 2016-07-27T01:23:35Z

I've taken a closer look at this and I'm not convinced that it is correct. Note that the largest partition gets pushed on the stack so that the stack never grows rapidly, in fact it grows fastest if each partitioning exactly divides the working partition into two equal pieces. I think that what you want to check is if it is growing too slowly, which implies that the partitions on the stack are larger than desired. What am I missing?

Note that the stack was managed that way so it never got bigger than log2(n) recursions.

charris · 2016-07-27T02:54:14Z

Or you could push the smallest partition on the stack instead, making it grow too fast, no need for another stack in that case. If you want to duplicate the introsort pseudo code, just push the first partition on the stack along with the depth, no need for an extra stack. There are a lot of ways to do it...

charris · 2016-07-27T04:02:43Z

If you don't push the largest partition on the stack you will want a bigger stack. Using our own stack rather than straight recursion adds speed, so we should keep it, but it complicates introsort. Note that if you push the smallest partitions a lot of them will already be sorted one element partitions by the time you hit maximum depth.

charris · 2016-07-27T05:39:43Z

I think the thing to do is ditch the new stack, decrement the depth after the partition, and put the depth on the stack along with the largest partition, then reset the depth from the stack when the stack is popped. The depth popped will not always be the same as the actual stack depth, which is as it should be as it will vary with the pattern of pushes and pops.

EDIT: I'd also move the heapsort check inside the main while loop so that insertion sort gets called by preference.

juliantaylor · 2016-07-27T06:34:05Z

it should be sufficient to only increase the depth counter on the large partition as the small partition is always less than 50% of the data and thus not degraded.

juliantaylor · 2016-07-27T06:43:33Z

to clarify, the proper mirroring of the recursive intro sort would be to check inside the while loop and increment the local cdepth counter on each iteration. The while loop is the tail call optimized small partition of the recursive quicksort so it is bounded to log(N) stack space.
But that should not be necessary to count the small partition steps as we only want to count the maximum recursion depth which is defined by the large partition part.

pv · 2016-07-27T08:24:16Z

Maybe also add benchmarks for the quicksort worst case + random input?

charris · 2016-07-27T13:55:10Z

Yes, the depth counter only matters when the depth is popped off the stack which is why it should be decrementing before pushing on the stack. If the check is inside the loop insertion sort will be called before heapsort except possibly after a stack pop. And please put everything on one stack, using a separate integer stack doesn't save much room.

EDIT: Using the original introsort decrementing depth.

charris · 2016-07-27T14:06:05Z

Maybe only check the depth immediately after the stack pop.

juliantaylor · 2016-07-27T14:26:58Z

hm the depth check could indeed be put behind the stack pop. But it is a quite minor optimization, the check is only true in rare exceptional cases in the first place, it shouldn't make much difference if one calls insertion sort or heapsort on the small remaining data then.

charris · 2016-07-27T14:52:00Z

shouldn't make much difference if one calls insertion sort or heapsort on the small remaining data then.

Agree, it would probably never be noticed.

charris · 2016-07-27T14:59:31Z

Does have the advantage that there is no goto and the while loop always runs to completion.

juliantaylor · 2016-07-27T15:04:53Z

you still need the goto as you have to skip the sorting of the already heapsorted array and get to the next stack pop
I also noticed the initialization of the depth stack is unnecessary, I'll update that later
I don't like merging into one stack very much, it would then contain values interpreted as pointers and values interpreted as integer which while technically the same does not really make the code nicer.

charris · 2016-07-27T15:13:15Z

OK, so in the original, just

            if (pi - pl < pr - pi) {
                *sptr++ = pi + 1;
                *sptr++ = pr;
                pr = pi - 1;
            }
            else {
                *sptr++ = pl;
                *sptr++ = pi - 1;
                pl = pi + 1;
            }
            *sptr++ = --depth;

and add

        if (sptr == stack) {
            break;
        }
        pr = *(--sptr);
        pl = *(--sptr);
        depth = *(--sptr);

Update the comment before the stack push and increase the stack size by 50, and I think that does it.

charris · 2016-07-27T15:16:34Z

it would then contain values interpreted as pointers and values

Oh, right. I missed that, forgot we were pushing pointers instead of indices. Could have done indices originally, I guess, but the pointers were a bit faster...

charris · 2016-07-27T16:22:43Z

Another option is to call heapsort rather than pushing the partition on the stack.

juliantaylor · 2016-07-27T17:42:21Z

I have removed the initialization of the depth stack and added some new benchmarks
I did not move the heapsort bail, having a sort after the stack management code does not match the flow of the function.
Avoiding a small heapsort for a very rare case is also not worth adding a (cheap) check to all cases.

charris · 2016-07-27T18:40:07Z

numpy/core/src/npysort/quicksort.c.src

@@ -107,6 +111,7 @@ quicksort_@suff@(void *start, npy_intp num, void *NOT_USED)
                *sptr++ = pi - 1;
                pl = pi + 1;
            }
+            *psdepth++ = cdepth + 1;


I think this should be *psdepth++ = ++cdepth;. cdepth should be increasing up the stack.

its cdepth + 1, the other variables are also changed without the postfix so this is consistent

I'm pretty sure it is wrong. Every push on the stack should increment, depth has nothing to do with actual stack depth here.

Just as every recursion in the introsort pseudo code decrements depth.

every stack push does increase the depth, we are only skipping the increase in the shorter partition as it is not necessary to keep track of it there

the depth is never decreased the only thing that is done is switch to a different depth in a different "recursive" calltree

doh right I moved this variable into a higher scope ...
it does not really matter for performance as we don't really care about the absolute depth of the stack we just want to cut of at some point
But it is relevant to avoid the overflow of the bounded to 100 elements stack which could happen for arrays greater than 2^50 elements so I'll change it
sorry for being so dense here

hm the overflow can still happen as we aren't checking the depth in the small partition, so lets just increase the stack from 100 to 128 (as each level adds two pointers)

Sounds good.

Still need the ++ though.

yes updated the code, still got the goto though
I'll sleep about it lets see how I like it tomorrow :)

juliantaylor · 2016-07-27T20:04:03Z

I'll switch it to decreasing the depth down from the limit, I initially increased it because I misguidedly though I had to initialize the depth stack (so memset(0) vs a for loop)
Then the pseudo code of the recursive implementation matches the iterative variant so I'll add that as a comment which should help clearing up how this works.

charris · 2016-07-28T20:39:10Z

I'm thinking that instead of npy_get_msb, make the stack of size_t type, init with double the count and shift right on each pass instead of decrementing. That solves the overflow problem and is also more flexible.

charris · 2016-07-28T21:00:22Z

numpy/core/src/npysort/quicksort.c.src

-    npy_intp depth_limit = npy_get_msb(num) * 2;
+    int depth[PYA_QS_STACK];
+    int * psdepth = depth;
+    int cdepth = NPY_SORT_MIN(PYA_QS_STACK, npy_get_msb(num) * 2);


I don't think * 2 is needed here, just adjust the cutoff point. Then you don't need the NPY_SORT_MIN either, especially as it is now 64. I suppose you could also use NPY_BITSOF_INTP

NVM. However, the stack can't fill up, so just

int cdepth = npy_get_msb(num) * 2;

the factor 2 is just some random increase (which should be documented) to give the algorithm some slack on not ideal data. The only thing that needs to be done is avoid that the depth is almost linear in number of elements so as long as the limit is constant * log2(n) we should have our worst case.
but the min can now be removed

Yep, cdepth can be anything that serves as a reasonable cutoff. It is independent of the actual stack depth as there may be jumps in it's value greater than one along the stack: partition gets popped, gets replaced by partition with cdepth decremented. Continuous bad partitioning actually leads to the smallest stack.

charris · 2016-07-28T21:03:12Z

Looks generally good, just a few comments for consideration.

charris · 2016-07-29T15:18:32Z

numpy/core/src/npysort/quicksort.c.src

@@ -33,10 +53,11 @@
 #include <stdlib.h>

 #define NOT_USED NPY_UNUSED(unused)
-#define PYA_QS_STACK 100
+#define PYA_QS_STACK 128


Maybe

#define PYA_QS_STACK 2*NPY_BITSOF_INTP;

Or just use NPY_BITSOF_INTP directly below.

juliantaylor · 2016-07-29T17:22:43Z

hm shifting the count is a neat way of writing it, but I think it makes it less obvious what we are doing. Performance wise it shouldn't matter, the msb is cheap (and could be further optimized using __builtin_clz)

charris · 2016-07-29T18:48:11Z

hm shifting the count is a neat way of writing it,

On second thought I decided it was a bad idea ;) It can't get the factor of two increase in bits.

the larger partition needs its own depth limit else it grows faster than logarithmic. Also increase the stack size to hold up to 64 instead of 50 pointer pairs else it might overflow on arrays larger than than 2^50 elements.

benchmark for random, all equal elements, many equal elements and median of 3 worst case

charris · 2016-07-29T23:03:58Z

numpy/core/src/npysort/quicksort.c.src

@@ -69,10 +94,12 @@ quicksort_@suff@(void *start, npy_intp num, void *NOT_USED)
    @type@ *stack[PYA_QS_STACK];
    @type@ **sptr = stack;
    @type@ *pm, *pi, *pj, *pk;
-    npy_intp depth_limit = npy_get_msb(num) * 2;
+    int depth[PYA_QS_STACK];


Well, this is large by a factor of 2, but it certainly doesn't hurt.

charris · 2016-07-29T23:07:50Z

Thanks Julian.

juliantaylor mentioned this pull request Jul 26, 2016

ENH: turn quicksort into introsort #7863

Merged

juliantaylor force-pushed the introsort2 branch 5 times, most recently from 7142aa5 to c173946 Compare July 26, 2016 19:32

juliantaylor force-pushed the introsort2 branch from c173946 to 6ca4971 Compare July 27, 2016 17:38

charris reviewed Jul 27, 2016
V 8000 iew reviewed changes

juliantaylor force-pushed the introsort2 branch from 6ca4971 to f4f060b Compare July 27, 2016 19:13

juliantaylor force-pushed the introsort2 branch from f4f060b to f190c61 Compare July 27, 2016 20:55

charris added 00 - Bug component: numpy._core labels Jul 28, 2016

charris added this to the 1.12.0 release milestone Jul 28, 2016

charris reviewed Jul 28, 2016
View reviewed changes

charris reviewed Jul 29, 2016
View reviewed changes

juliantaylor added 2 commits July 29, 2016 21:41

BUG: handle introsort depth limit properly

0720142

the larger partition needs its own depth limit else it grows faster than logarithmic. Also increase the stack size to hold up to 64 instead of 50 pointer pairs else it might overflow on arrays larger than than 2^50 elements.

BENCH: add more sort benchmarks

afc4f24

benchmark for random, all equal elements, many equal elements and median of 3 worst case

juliantaylor force-pushed the introsort2 branch from f190c61 to afc4f24 Compare July 29, 2016 19:42

charris reviewed Jul 29, 2016
View reviewed changes

charris merged commit 9506037 into numpy:master Jul 29, 2016

juliantaylor deleted the introsort2 branch July 30, 2016 19:29

Uh oh!

BUG: handle introsort depth limit properly #7871

BUG: handle introsort depth limit properly #7871

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!