-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
MAINT: Rewrite Floyd algorithm #13812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- use a simple hash table with linear probing
- adjust heuristic
This is to address scipy/scipy#9699. See also #5299 To sample sparse matrices with a given density of non zero values, scipy calls choice(N, k, replace=True), where N large and k<<N. The best algorithm is Floyd's algorithm which is O(k). The current implementation wasn't used most of the time given to an incorrect heuristic (looks like it assumed Floyd algorithm is O(k**2) where it's actually O(k)). I've also replaced the set structure in python with a simple hash table which has much lower memory usage and is quite a bit faster. Timing tests show that the new version of the Floyd sampling beats the Fisher-Yates shuffle even when k~N. |
numpy/random/generator.pyx
Outdated
set_size = <uint64_t>(1.2 * size_i) | ||
mask = _gen_mask(set_size) | ||
set_size = 1 + mask | ||
hash_set = <uint64_t*>malloc(sizeof(uint64_t) * set_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gain to using malloc and free seem pretty small when compared to allocating an array with dtype uint64_t. If you go this way need to gracefully handle the unlikely event that malloc fails (if you use an array, you get the handling for free). There is also a question as to whether you should be using malloc/free or PyMem_Malloc/PyMem_Free, which I'm not certain about. The latter is faster under some (unknown to me) conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good.
numpy/random/generator.pyx
Outdated
idx_data[loc] = val = j | ||
idx_set.add(val) | ||
loc += 1 | ||
val = random_interval(&self._bitgen, j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be using random_bounded_uint64(&self._bitgen, 0, j, 0, 0)
here which is faster for most j that aren't very close to powers of 2.
The O(k I assumed this is correct: https://wiki.python.org/moin/TimeComplexity |
If you have time it would be good to improve the heuristic since you are still switching for k > .1N |
If you could show that it is faster for k>=0.5N even if it slower for k > x%N for large x near 1 (and large N) then it might be possible to return the compliment (although not sure about how to order though). |
How does this compare to #10124 if p is set to 1/pop_size? |
For the set complexity, my understanding is a lookup is O(1) amortized, and worst case O(n) if the hash table behind it gets resized. But we know exactly how big the hash table needs to be, which is what I did. There is also no need to worry about collisions because we're sampling uniformly from 1..N, so the data is as well behaved as possible. |
To give an idea, timing results on my machine.
for tail Fisher-Yates shuffle:
So Floyd is actually slightly faster. I'm not quite sure why. Maybe _shuffle_raw needs to handle strides, etc... which slows it down. |
Given your timings it is tempting to use this in all cases. Plus it reduces a branch. I can see how your specialized hash table is much faster than using a Python set. |
This algorithm is O(n) since it needs to draw n exponentials, and then find the top k values which is also O(n) https://en.wikipedia.org/wiki/Selection_algorithm. So worse than Floyd. |
Can this close #5158? |
@mattip This is worth serious consideration for 1.17, when you have time. It is much easier to get this in now than to wait until after 1.17 since it will change the ordering for a given seed/bit generator. This implementation is a far superior one to the naive version that is currently in Generator. |
Needs rebase. Is that test correct? @bashtage Just to be sure, you think this should go in? |
It is a definite improvement, and since changing the stream is still difficult (just not impossible), it is better to get it in now rather than later. @thrasibule could you do a simple statistical analysis of the changes to make sure that it appears to have the correct properties? You couldn't for example, sample |
@bashtage something like that ?
|
I have a couple of questions:
|
Is this a concern? I vaguely recall that application was mentioned at some point with regard to bootstrapping(?) Or maybe that was for selection with replacement. |
Is this specific to this implementation? I don't think that is true for the Floyd implementation in master (just checked, it is not, e.g., Can this be worked around if a different hash implementation was used (i.e. faster than set() but perhaps slower than your table)? This makes switching between Floyd and tail seem stranger as well.
This is probably vestigial, but one has to be certain. |
Just a thought -- could you store a second table that has the order information for each value selected, and then do a quick reorder at the end? So that when n=k, the order that each value is selected would be preserved in the output. For example, if 0 is selected 10th, then out[9] = 0, while if 0 is selected last, then out[-1] = 0. |
It depends what the use case is. A specific use case could be to have the sample in sorted order actually, so having it shuffled doesn't bring any benefit.
No the implementation in master already had the same issue. You couldn't see it because the heuristic does a tail shuffle if k=n. When k <n, with Floyd's algorithm, the indices will tend to be lower at the beginning of the sample than at the end of the sample. I agree that the change in behavior is strange if we keep both implementations. So we can either just keep Floyd's and specify in the doc string that the sample is not a permutation. Or keep both and do a shuffle at the end of Floyd's algorithm (this will still be O(k)). |
Yup - you are completely correct. I see it now. (I also had the wrong command above). IMO we need a shuffle at the since any users who expected similar behavior w.r.t. order from RandomState could be very surprised. This much have an impact on the break-even point I suspect. Around 50%? |
If you want to consider a plain Floyd with no shuffle then I think this needs to go to the mailing list since it seems to be more than a refactor. |
@bashtage What should we do with this for 1.17? |
I am just one user, but I am in favor of including the new algorithm without a shuffle at the end in v1.17. I could really use the fastest This is the one chance to simplify the API. Generator already is not a drop in replacement for RandomState because they’re not stream compatible. If the change in API is too much to stomach, at least expose the fast path in a separate function so I can call it without the shuffle. |
You’ve already done all the experiments to figure out the heuristic, which is just an extra line when you add in the branch for the keyword-dependent shuffle. Is it really that much more complex of code that you’d let performance suffer? |
I thought it was more involved, but in psuedo-code I think it would be:
This said, I still think there is a case to be made for limiting the allocation of a large array when it isn't actually required, but then the tuning requires a lot of guess-work. A better solution would probably be to add something to the notes that explains that using |
I was thinking more something along these lines:
Seems like no solution is perfect anyway unless we expose all the knobs. I'll try to send an email to the mailing list summarizing the discussion so far. In any case, whatever gets decided will be strictly faster than before, so it's still a net win, even if not optimal in all situations. |
Sent the |
hash_set[loc] = j | ||
idx_data[j - pop_size_i + size_i] = j | ||
if shuffle: | ||
self._shuffle_int(size_i, 1, idx_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to reuse the existing shuffle
or _shuffle_raw
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added _shuffle_int as a specialized version of _shuffle_raw because it's ~30% faster for integers.
- use a simple hash table with linear probing - adjust heuristic
Thanks @mattip for the post on the mailing list. I've tried to address all the suggestions so far. I've also experimented with using a bitset instead of a hash table. It makes sense to use it from a memory perspective when k>=N/64. It's also slightly faster because testing for an element is just a bitshift and an and operation. I've written the code in a different branch, see here. I've refactored the logic of the Floyd algorithm into 2 functions which I've added as static methods to the Generator class. I'm not sure it's the best place to put them. Maybe generator.pxd would be better. Anyway here are some timings: it's faster in the shuffle=False, k>=N/50 range.
|
Thanks @thrasibule . |