8000 MAINT: Rewrite Floyd algorithm by thrasibule · Pull Request #13812 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content
8000

MAINT: Rewrite Floyd algorithm #13812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 13, 2019
Merged

MAINT: Rewrite Floyd algorithm #13812

merged 13 commits into from
Jul 13, 2019

Conversation

thrasibule
Copy link
Contributor
  • use a simple hash table with linear probing
  • adjust heuristic

@thrasibule
Copy link
Contributor Author

This is to address scipy/scipy#9699. See also #5299 To sample sparse matrices with a given density of non zero values, scipy calls choice(N, k, replace=True), where N large and k<<N. The best algorithm is Floyd's algorithm which is O(k). The current implementation wasn't used most of the time given to an incorrect heuristic (looks like it assumed Floyd algorithm is O(k**2) where it's actually O(k)).

I've also replaced the set structure in python with a simple hash table which has much lower memory usage and is quite a bit faster. Timing tests show that the new version of the Floyd sampling beats the Fisher-Yates shuffle even when k~N.

@charris charris changed the title Rewrite Floyd algorithm MAINT: Rewrite Floyd algorithm Jun 20, 2019
set_size = <uint64_t>(1.2 * size_i)
mask = _gen_mask(set_size)
set_size = 1 + mask
hash_set = <uint64_t*>malloc(sizeof(uint64_t) * set_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gain to using malloc and free seem pretty small when compared to allocating an array with dtype uint64_t. If you go this way need to gracefully handle the unlikely event that malloc fails (if you use an array, you get the handling for free). There is also a question as to whether you should be using malloc/free or PyMem_Malloc/PyMem_Free, which I'm not certain about. The latter is faster under some (unknown to me) conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good.

idx_data[loc] = val = j
idx_set.add(val)
loc += 1
val = random_interval(&self._bitgen, j)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be using random_bounded_uint64(&self._bitgen, 0, j, 0, 0) here which is faster for most j that aren't very close to powers of 2.

@bashtage
Copy link
Contributor
bashtage commented Jun 21, 2019

The current implementation wasn't used most of the time given to an incorrect heuristic (looks like it assumed Floyd algorithm is O(k**2) where it's actually O(k)).

The O(k**2) comes from two features. 1. A lookup in a Python set is worst case O(n) where n is the number of elements. I don't know where the n kicks in. If it is only if it hits O(n) well before n\approx k (in particular, for some value where 1-n/k > c > 0 for some c > 0, then one would end up checking c*k at a cost of O(k-n), hence the O(k**2). This plus timing lead to the improved tail sort (which is O(pop_size) since it allocates an array 1:pop_size).

I assumed this is correct: https://wiki.python.org/moin/TimeComplexity

@bashtage
Copy link
Contributor

Timing tests show that the new version of the Floyd sampling beats the Fisher-Yates shuffle even when k~N.

If you have time it would be good to improve the heuristic since you are still switching for k > .1N

@bashtage
Copy link
Contributor

If you could show that it is faster for k>=0.5N even if it slower for k > x%N for large x near 1 (and large N) then it might be possible to return the compliment (although not sure about how to order though).

@bashtage
Copy link
Contributor

How does this compare to #10124 if p is set to 1/pop_size?

@thrasibule
Copy link
Contributor Author

For the set complexity, my understanding is a lookup is O(1) amortized, and worst case O(n) if the hash table behind it gets resized. But we know exactly how big the hash table needs to be, which is what I did. There is also no need to worry about collisions because we're sampling uniformly from 1..N, so the data is as well behaved as possible.
So theoretically they're both O(k), but Fisher Yates is O(n) in space and Floyd is O(k). The only reason I still have an heuristic is that when k=n, it seems hard to bit Fisher-Yates which just does n swaps, which is as fast as it gets.

@thrasibule
Copy link
Contributor Author

To give an idea, timing results on my machine.
For the rewritten Floyd algorithm:

In [5]: %timeit choice(10_000, 10_000, replace=False)        
101 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: %timeit choice(100_000, 100_000, replace=False)    
861 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit choice(1_000_000, 1_000_000, replace=False)                                                 
14.1 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit choice(10_000_000, 10_000_000, replace=False)                                                 
498 ms ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit choice(100_000_000, 100_000_000, replace=False)                                             
6 s ± 53.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit choice(1_000_000_000, 100_000_000, replace=False)                                           
9.01 s ± 70.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit choice(10_000_000_000, 100_000_000, replace=False)                                          
9.99 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for tail Fisher-Yates shuffle:

In [2]: %timeit choice(10_000, 10_000, replace=False)      
172 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit choice(100_000, 100_000, replace=False)    
1.62 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit choice(1_000_000, 1_000_000, replace=False)                                         
24.6 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit choice(10_000_000, 10_000_000, replace=False)                                                 
573 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit choice(100_000_000, 100_000_000, replace=False)                                               
6.95 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So Floyd is actually slightly faster. I'm not quite sure why. Maybe _shuffle_raw needs to handle strides, etc... which slows it down.

@bashtage
Copy link
Contributor

Given your timings it is tempting to use this in all cases. Plus it reduces a branch.

I can see how your specialized hash table is much faster than using a Python set.

@thrasibule
Copy link
Contributor Author

How does this compare to #10124 if p is set to 1/pop_size?

This algorithm is O(n) since it needs to draw n exponentials, and then find the top k values which is also O(n) https://en.wikipedia.org/wiki/Selection_algorithm. So worse than Floyd.

@wkschwartz
Copy link

Can this close #5158?

@bashtage
Copy link
Contributor

@mattip This is worth serious consideration for 1.17, when you have time. It is much easier to get this in now than to wait until after 1.17 since it will change the ordering for a given seed/bit generator. This implementation is a far superior one to the naive version that is currently in Generator.

@charris charris added this to the 1.17.0 release milestone Jun 26, 2019
@charris
Copy link
Member
charris commented Jun 26, 2019

Needs rebase. Is that test correct? @bashtage Just to be sure, you think this should go in?

@bashtage
Copy link
Contributor

It is a definite improvement, and since changing the stream is still difficult (just not impossible), it is better to get it in now rather than later.

@thrasibule could you do a simple statistical analysis of the changes to make sure that it appears to have the correct properties? You couldn't for example, sample k w/o replacement from arange(0,100) many times and then compute the historgram of the empirical frequency of any observation appearing. You could also do more sophisticated tests, but a histogram is simple and would likely show any glaring issues.

@thrasibule
Copy link
Contributor Author

@bashtage something like that ?

In [1]: import numpy as np                                                     

In [2]: from numpy.random.generator import choice                              

In [3]: nsim = 10_000 
   ...: sample_size = 100_000 
   ...: for k in [100, 200, 300, 500, 1000, 5000, 10_000]: 
   ...:     counts = np.zeros(100_000) 
   ...:     for l in range(10_000): 
   ...:         counts += np.bincount(choice(100_000, k, replace=False), minlen
   ...: gth=100_000) 
   ...:     expected_var = nsim * k/sample_size * (1-k/sample_size) 
   ...:     print(counts.var(), expected_var) 
   ...:                                                                        
9.9686 9.99
20.0336 19.96
29.98722 29.91
49.85672 49.75
99.60878 99.0
478.04698 475.0
903.26496 900.0

@thrasibule
Copy link
Contributor Author

I have a couple of questions:

  • should we keep the heuristic which switches to tail shuffle, or just use Floyd?
  • I've disabled wraparound (cython.wraparound(False) ) in the cython block that I wrote, which makes sense since I build a power of 2 sized hash table to specifically have fast wraparound, so no need for cython to do its slow wraparound on top. I noticed the choice method is marked cython.wraparound(True), even though it doesn't look like code in that method relies on the wraparound. Should I just remote that line then?
  • One difference between the tail shuffle algorithm and Floyd, is that with tail shuffle, the sample will come in random order. In particular any slice of size k' < k will be a valid sample without replacement. Whereas with Floyd, the order is arbitrary but not uniform among the permutations of size k. This is especially clear when n=k, where Floyd will always return 0..n-1 in that order. Maybe this should be mentioned in the docstring?

@charris
Copy link
Member
charris commented Jun 28, 2019

One difference between the tail shuffle algorithm and Floyd, is that with tail shuffle, the sample will come in random order. In particular any slice of size k' < k will be a valid sample without replacement. Whereas with Floyd, the order is arbitrary but not uniform among the permutations of size k. This is especially clear when n=k, where Floyd will always return 0..n-1 in that order. Maybe this should be mentioned in the docstring?

Is this a concern? I vaguely recall that application was mentioned at some point with regard to bootstrapping(?) Or maybe that was for selection with replacement.

@bashtage
Copy link
Contributor
  • One difference between the tail shuffle algorithm and Floyd, is that with tail shuffle, the sample will come in random order. In particular any slice of size k' < k will be a valid sample without replacement. Whereas with Floyd, the order is arbitrary but not uniform among the permutations of size k. This is especially clear when n=k, where Floyd will always return 0..n-1 in that order. Maybe this should be mentioned in the docstring?

Is this specific to this implementation? I don't think that is true for the Floyd implementation in master (just checked, it is not, e.g., g.choice(np.arange(100),100)). This is undesirable IMO, and feels a bit like an API change since one really needs to shuffle the choice to get the same API as the current one when n=k.

Can this be worked around if a different hash implementation was used (i.e. faster than set() but perhaps slower than your table)?

This makes switching between Floyd and tail seem stranger as well.

I've disabled wraparound (cython.wraparound(False) ) in the cython block that I wrote, which makes sense since I build a power of 2 sized hash table to specifically have fast wraparound, so no need for cython to do its slow wraparound on top. I noticed the choice method is marked cython.wraparound(True), even though it doesn't look like code in that method relies on the wraparound. Should I just remote that line then?

This is probably vestigial, but one has to be certain.

@bashtage
Copy link
Contributor

Just a thought -- could you store a second table that has the order information for each value selected, and then do a quick reorder at the end?

So that when n=k, the order that each value is selected would be preserved in the output. For example, if 0 is selected 10th, then out[9] = 0, while if 0 is selected last, then out[-1] = 0.

@thrasibule
Copy link
Contributor Author

Is this a concern? I vaguely recall that application was mentioned at some point with regard to bootstrapping(?) Or maybe that was for selection with replacement.

It depends what the use case is. A specific use case could be to have the sample in sorted order actually, so having it shuffled doesn't bring any benefit.

Is this specific to this implementation? I don't think that is true for the Floyd implementation in master (just checked, it is not, e.g., g.choice(np.arange(100),100)). This is undesirable IMO, and feels a bit like an API change since one really needs to shuffle the choice to get the same API as the current one when n=k.

No the implementation in master already had the same issue. You couldn't see it because the heuristic does a tail shuffle if k=n. When k <n, with Floyd's algorithm, the indices will tend to be lower at the beginning of the sample than at the end of the sample.

I agree that the change in behavior is strange if we keep both implementations. So we can either just keep Floyd's and specify in the doc string that the sample is not a permutation. Or keep both and do a shuffle at the end of Floyd's algorithm (this will still be O(k)).

@bashtage
Copy link
Contributor

Yup - you are completely correct. I see it now. (I also had the wrong command above).

IMO we need a shuffle at the since any users who expected similar behavior w.r.t. order from RandomState could be very surprised. This much have an impact on the break-even point I suspect. Around 50%?

@bashtage
Copy link
Contributor

If you want to consider a plain Floyd with no shuffle then I think this needs to go to the mailing list since it seems to be more than a refactor.

@charris
Copy link
Member
charris commented Jun 30, 2019

@bashtage What should we do with this for 1.17?

628C
@wkschwartz
Copy link

I am just one user, but I am in favor of including the new algorithm without a shuffle at the end in v1.17. I could really use the fastest choice I can get my hands on. If I need a random permutation, I’ll call choice followed by permutation.

This is the one chance to simplify the API. Generator already is not a drop in replacement for RandomState because they’re not stream compatible.

If the change in API is too much to stomach, at least expose the fast path in a separate function so I can call it without the shuffle.

@wkschwartz
Copy link

You’ve already done all the experiments to figure out the heuristic, which is just an extra line when you add in the branch for the keyword-dependent shuffle. Is it really that much more complex of code that you’d let performance suffer?

@bashtage
Copy link
Contributor
bashtage commented Jul 7, 2019

You’ve already done all the experiments to figure out the heuristic, which is just an extra line when you add in the branch for the keyword-dependent shuffle. Is it really that much more complex of code that you’d let performance suffer?

I thought it was more involved, but in psuedo-code I think it would be:

if pop_size is small or size/pop_size is not too big or not shuffle:
       floyd
       if shuffle:
           do_shuffle
else:
      tail_shuffle

This said, I still think there is a case to be made for limiting the allocation of a large array when it isn't actually required, but then the tuning requires a lot of guess-work. A better solution would probably be to add something to the notes that explains that using shuffle=False minimizes memory use, and if you need a shuffle, then you can always do it later. But this is also difficult to fully explain since it would only be completely accurate in some circumstances.

@thrasibule
Copy link
Contributor Author

I was thinking more something along these lines:

if shuffle:
    cutoff = 50
else:
    cutoff = 20

if pop_size is small or pop_size // size > cutoff:

       floyd

       if shuffle:

           do_shuffle

else:

      tail_shuffle

Seems like no solution is perfect anyway unless we expose all the knobs. I'll try to send an email to the mailing list summarizing the discussion so far. In any case, whatever gets decided will be strictly faster than before, so it's still a net win, even if not optimal in all situations.

@mattip
Copy link
Member
mattip commented Jul 10, 2019

Sent the shuffle=True kwarg change to the mailing list

hash_set[loc] = j
idx_data[j - pop_size_i + size_i] = j
if shuffle:
self._shuffle_int(size_i, 1, idx_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to reuse the existing shuffle or _shuffle_raw ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added _shuffle_int as a specialized version of _shuffle_raw because it's ~30% faster for integers.

@thrasibule
Copy link
Contributor Author

Thanks @mattip for the post on the mailing list. I've tried to address all the suggestions so far.

I've also experimented with using a bitset instead of a hash table. It makes sense to use it from a memory perspective when k>=N/64. It's also slightly faster because testing for an element is just a bitshift and an and operation. I've written the code in a different branch, see here. I've refactored the logic of the Floyd algorithm into 2 functions which I've added as static methods to the Generator class. I'm not sure it's the best place to put them. Maybe generator.pxd would be better. Anyway here are some timings: it's faster in the shuffle=False, k>=N/50 range.

N,k HashSet, shuffle=False BitSet, shuffle=False HashSet, shuffle=True Bitset, shuffle=True
N=1e8, k=1e5 1.97 ms ± 12.4 µs 2.07 ms ± 13.3 µs 2.79 ms ± 19.6 µs 3.02 ms ± 35.9 µs
N=1e8, k=1e6 54.3 ms ± 1.06 ms 52.8 ms ± 1.71 ms 67.6 ms ± 639 µs 65.1 ms ± 1.7 ms
N=1e8, k=2e6 125 ms ± 1.09 ms 104 ms ± 1.36 ms 175 ms ± 2.09 ms 147 ms ± 1.44 ms
N=1e8, k=5e6 354 ms ± 6.78 ms 265 ms ± 4.75 ms 372 ms ± 2.38 ms 378 ms ± 9.21 ms
N=1e8, k=1e7 610 ms ± 8.72 ms 507 ms ± 6.06 ms 625 ms ± 9.38 ms 629 ms ± 6.28 ms
N=1e8, k=5e7 2.54 s ± 35.6 ms 2.09 s ± 44.1 ms 2.53 s ± 11.5 ms 2.62 s ± 13.3 ms
N=1e8, k=1e8 4.81 s ± 77 ms 2.93 s ± 21.2 ms 4.74 s ± 65.6 ms 4.85 s ± 49.5 ms

@charris charris merged commit 66c6793 into numpy:master Jul 13, 2019
@charris
Copy link
Member
charris commented Jul 13, 2019

Thanks @thrasibule .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0