8000 BUG: handle introsort depth limit properly by juliantaylor · Pull Request #7871 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: handle introsort depth limit properly #7871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 29, 2016

Conversation

juliantaylor
Copy link
Contributor

each partition needs its own depth limit else it grows faster than
logarithmic ...

@juliantaylor juliantaylor force-pushed the introsort2 branch 5 times, most recently from 7142aa5 to c173946 Compare July 26, 2016 19:32
@charris
Copy link
Member
charris commented Jul 27, 2016

I've taken a closer look at this and I'm not convinced that it is correct. Note that the largest partition gets pushed on the stack so that the stack never grows rapidly, in fact it grows fastest if each partitioning exactly divides the working partition into two equal pieces. I think that what you want to check is if it is growing too slowly, which implies that the partitions on the stack are larger than desired. What am I missing?

Note that the stack was managed that way so it never got bigger than log2(n) recursions.

@charris
Copy link
Member
charris commented Jul 27, 2016

Or you could push the smallest partition on the stack instead, making it grow too fast, no need for another stack in that case. If you want to duplicate the introsort pseudo code, just push the first partition on the stack along with the depth, no need for an extra stack. There are a lot of ways to do it...

@charris
Copy link
Member
charris commented Jul 27, 2016

If you don't push the largest partition on the stack you will want a bigger stack. Using our own stack rather than straight recursion adds speed, so we should keep it, but it complicates introsort. Note that if you push the smallest partitions a lot of them will already be sorted one element partitions by the time you hit maximum depth.

@charris
Copy link
Member
charris commented Jul 27, 2016

I think the thing to do is ditch the new stack, decrement the depth after the partition, and put the depth on the stack along with the largest partition, then reset the depth from the stack when the stack is popped. The depth popped will not always be the same as the actual stack depth, which is as it should be as it will vary with the pattern of pushes and pops.

EDIT: I'd also move the heapsort check inside the main while loop so that insertion sort gets called by preference.

@juliantaylor
Copy link
Contributor Author

it should be sufficient to only increase the depth counter on the large partition as the small partition is always less than 50% of the data and thus not degraded.

@juliantaylor
Copy link
Contributor Author

to clarify, the proper mirroring of the recursive intro sort would be to check inside the while loop and increment the local cdepth counter on each iteration. The while loop is the tail call optimized small partition of the recursive quicksort so it is bounded to log(N) stack space.
But that should not be necessary to count the small partition steps as we only want to count the maximum recursion depth which is defined by the large partition part.

@pv
Copy link
Member
pv commented Jul 27, 2016 via email

@charris
Copy link
Member
charris commented Jul 27, 2016

Yes, the depth counter only matters when the depth is popped off the stack which is why it should be decrementing before pushing on the stack. If the check is inside the loop insertion sort will be called before heapsort except possibly after a stack pop. And please put everything on one stack, using a separate integer stack doesn't save much room.

EDIT: Using the original introsort decrementing depth.

@charris
Copy link
Member
charris commented Jul 27, 2016

Maybe only check the depth immediately after the stack pop.

@juliantaylor
Copy link
Contributor Author

hm the depth check could indeed be put behind the stack pop. But it is a quite minor optimization, the check is only true in rare exceptional cases in the first place, it shouldn't make much difference if one calls insertion sort or heapsort on the small remaining data then.

@charris
Copy link
Member
charris commented Jul 27, 2016

shouldn't make much difference if one calls insertion sort or heapsort on the small remaining data then.

Agree, it would probably never be noticed.

@charris
Copy link
Member
charris commented Jul 27, 2016

Does have the advantage that there is no goto and the while loop always runs to completion.

@juliantaylor
Copy link
Contributor Author

you still need the goto as you have to skip the sorting of the already heapsorted array and get to the next stack pop
I also noticed the initialization of the depth stack is unnecessary, I'll update that later
I don't like merging into one stack very much, it would then contain values interpreted as pointers and values interpreted as integer which while technically the same does not really make the code nicer.

@charris
Copy link
Member
charris commented Jul 27, 2016

OK, so in the original, just

            if (pi - pl < pr - pi) {
                *sptr++ = pi + 1;
                *sptr++ = pr;
                pr = pi - 1;
            }
            else {
                *sptr++ = pl;
                *sptr++ = pi - 1;
                pl = pi + 1;
            }
            *sptr++ = --depth;

and add

        if (sptr == stack) {
            break;
        }
        pr = *(--sptr);
        pl = *(--sptr);
        depth = *(--sptr);

Update the comment before the stack push and increase the stack size by 50, and I think that does it.

@charris
Copy link
Member
charris commented Jul 27, 2016

it would then contain values interpreted as pointers and values

Oh, right. I missed that, forgot we were pushing pointers instead of indices. Could have done indices originally, I guess, but the pointers were a bit faster...

@charris
Copy link
Member
charris commented Jul 27, 2016

Another option is to call heapsort rather than pushing the partition on the stack.

@juliantaylor
Copy link
Contributor Author

I have removed the initialization of the depth stack and added some new benchmarks
I did not move the heapsort bail, having a sort after the stack management code does not match the flow of the function.
Avoiding a small heapsort for a very rare case is also not worth adding a (cheap) check to all cases.

@@ -107,6 +111,7 @@ quicksort_@suff@(void *start, npy_intp num, void *NOT_USED)
*sptr++ = pi - 1;
pl = pi + 1;
}
*psdepth++ = cdepth + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be *psdepth++ = ++cdepth;. cdepth should be increasing up the stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its cdepth + 1, the other variables are also changed without the postfix so this is consistent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it is wrong. Every push on the stack should increment, depth has nothing to do with actual stack depth here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as every recursion in the introsort pseudo code decrements depth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every stack push does increase the depth, we are only skipping the increase in the shorter partition as it is not necessary to keep track of it there

the depth is never decreased the only thing that is done is switch to a different depth in a different "recursive" calltree

Copy link
Contributor Author
@juliantaylor juliantaylor Jul 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh right I moved this variable into a higher scope ...
it does not really matter for performance as we don't really care about the absolute depth of the stack we just want to cut of at some point
But it is relevant to avoid the overflow of the bounded to 100 elements stack which could happen for arrays greater than 2^50 elements so I'll change it
sorry for being so dense here

6D40

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm the overflow can still happen as we aren't checking the depth in the small partition, so lets just increase the stack from 100 to 128 (as each level adds two pointers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need the ++ though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes updated the code, still got the goto though
I'll sleep about it lets see how I like it tomorrow :)

@juliantaylor
Copy link
Contributor Author

I'll switch it to decreasing the depth down from the limit, I initially increased it because I misguidedly though I had to initialize the depth stack (so memset(0) vs a for loop)
Then the pseudo code of the recursive implementation matches the iterative variant so I'll add that as a comment which should help clearing up how this works.

@charris
Copy link
Member
charris commented Jul 28, 2016

I'm thinking that instead of npy_get_msb, make the stack of size_t type, init with double the count and shift right on each pass instead of decrementing. That solves the overflow problem and is also more flexible.

npy_intp depth_limit = npy_get_msb(num) * 2;
int depth[PYA_QS_STACK];
int * psdepth = depth;
int cdepth = NPY_SORT_MIN(PYA_QS_STACK, npy_get_msb(num) * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think * 2 is needed here, just adjust the cutoff point. Then you don't need the NPY_SORT_MIN either, especially as it is now 64. I suppose you could also use NPY_BITSOF_INTP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM. However, the stack can't fill up, so just

int cdepth = npy_get_msb(num) * 2;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the factor 2 is just some random increase (which should be documented) to give the algorithm some slack on not ideal data. The only thing that needs to be done is avoid that the depth is almost linear in number of elements so as long as the limit is constant * log2(n) we should have our worst case.
but the min can now be removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, cdepth can be anything that serves as a reasonable cutoff. It is independent of the actual stack depth as there may be jumps in it's value greater than one along the stack: partition gets popped, gets replaced by partition with cdepth decremented. Continuous bad partitioning actually leads to the smallest stack.

@charris
Copy link
Member
charris commented Jul 28, 2016

Looks generally good, just a few comments for consideration.

@@ -33,10 +53,11 @@
#include <stdlib.h>

#define NOT_USED NPY_UNUSED(unused)
#define PYA_QS_STACK 100
#define PYA_QS_STACK 128
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

#define PYA_QS_STACK 2*NPY_BITSOF_INTP;

Or just use NPY_BITSOF_INTP directly below.

@juliantaylor
Copy link
Contributor Author

hm shifting the count is a neat way of writing it, but I think it makes it less obvious what we are doing. Performance wise it shouldn't matter, the msb is cheap (and could be further optimized using __builtin_clz)

@charris
Copy link
Member
charris commented Jul 29, 2016

hm shifting the count is a neat way of writing it,

On second thought I decided it was a bad idea ;) It can't get the factor of two increase in bits.

the larger partition needs its own depth limit else it grows faster than
logarithmic.
Also increase the stack size to hold up to 64 instead of 50 pointer
pairs else it might overflow on arrays larger than than 2^50 elements.
benchmark for random, all equal elements, many equal elements and median
of 3 worst case
@@ -69,10 +94,12 @@ quicksort_@suff@(void *start, npy_intp num, void *NOT_USED)
@type@ *stack[PYA_QS_STACK];
@type@ **sptr = stack;
@type@ *pm, *pi, *pj, *pk;
npy_intp depth_limit = npy_get_msb(num) * 2;
int depth[PYA_QS_STACK];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is large by a factor of 2, but it certainly doesn't hurt.

@charris charris merged commit 9506037 into numpy:master Jul 29, 2016
@charris
Copy link
Member
charris 7FD7 commented Jul 29, 2016

Thanks Julian.

@juliantaylor juliantaylor deleted the introsort2 branch July 30, 2016 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0