MAINT: Rewrite Floyd algorithm #13812

thrasibule · 2019-06-20T20:36:03Z

use a simple hash table with linear probing
adjust heuristic

thrasibule · 2019-06-20T20:46:54Z

This is to address scipy/scipy#9699. See also #5299 To sample sparse matrices with a given density of non zero values, scipy calls choice(N, k, replace=True), where N large and k<<N. The best algorithm is Floyd's algorithm which is O(k). The current implementation wasn't used most of the time given to an incorrect heuristic (looks like it assumed Floyd algorithm is O(k**2) where it's actually O(k)).

I've also replaced the set structure in python with a simple hash table which has much lower memory usage and is quite a bit faster. Timing tests show that the new version of the Floyd sampling beats the Fisher-Yates shuffle even when k~N.

bashtage · 2019-06-21T16:33:03Z

numpy/random/generator.pyx

+                    set_size = <uint64_t>(1.2 * size_i)
+                    mask = _gen_mask(set_size)
+                    set_size = 1 + mask
+                    hash_set = <uint64_t*>malloc(sizeof(uint64_t) * set_size)


The gain to using malloc and free seem pretty small when compared to allocating an array with dtype uint64_t. If you go this way need to gracefully handle the unlikely event that malloc fails (if you use an array, you get the handling for free). There is also a question as to whether you should be using malloc/free or PyMem_Malloc/PyMem_Free, which I'm not certain about. The latter is faster under some (unknown to me) conditions.

Changes look good.

bashtage · 2019-06-21T16:39:51Z

numpy/random/generator.pyx

-                                idx_data[loc] = val = j
-                            idx_set.add(val)
-                            loc += 1
+                            val = random_interval(&self._bitgen, j)


You should be using random_bounded_uint64(&self._bitgen, 0, j, 0, 0) here which is faster for most j that aren't very close to powers of 2.

bashtage · 2019-06-21T17:25:01Z

The current implementation wasn't used most of the time given to an incorrect heuristic (looks like it assumed Floyd algorithm is O(k**2) where it's actually O(k)).

The O(k**2) comes from two features. 1. A lookup in a Python set is worst case O(n) where n is the number of elements. I don't know where the n kicks in. If it is only if it hits O(n) well before n\approx k (in particular, for some value where 1-n/k > c > 0 for some c > 0, then one would end up checking c*k at a cost of O(k-n), hence the O(k**2). This plus timing lead to the improved tail sort (which is O(pop_size) since it allocates an array 1:pop_size).

I assumed this is correct: https://wiki.python.org/moin/TimeComplexity

bashtage · 2019-06-21T17:27:01Z

Timing tests show that the new version of the Floyd sampling beats the Fisher-Yates shuffle even when k~N.

If you have time it would be good to improve the heuristic since you are still switching for k > .1N

bashtage · 2019-06-21T17:34:15Z

If you could show that it is faster for k>=0.5N even if it slower for k > x%N for large x near 1 (and large N) then it might be possible to return the compliment (although not sure about how to order though).

bashtage · 2019-06-21T17:41:00Z

How does this compare to #10124 if p is set to 1/pop_size?

thrasibule · 2019-06-21T17:42:14Z

For the set complexity, my understanding is a lookup is O(1) amortized, and worst case O(n) if the hash table behind it gets resized. But we know exactly how big the hash table needs to be, which is what I did. There is also no need to worry about collisions because we're sampling uniformly from 1..N, so the data is as well behaved as possible.
So theoretically they're both O(k), but Fisher Yates is O(n) in space and Floyd is O(k). The only reason I still have an heuristic is that when k=n, it seems hard to bit Fisher-Yates which just does n swaps, which is as fast as it gets.

thrasibule · 2019-06-21T20:10:46Z

To give an idea, timing results on my machine.
For the rewritten Floyd algorithm:

In [5]: %timeit choice(10_000, 10_000, replace=False)        
101 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: %timeit choice(100_000, 100_000, replace=False)    
861 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit choice(1_000_000, 1_000_000, replace=False)                                                 
14.1 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit choice(10_000_000, 10_000_000, replace=False)                                                 
498 ms ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit choice(100_000_000, 100_000_000, replace=False)                                             
6 s ± 53.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit choice(1_000_000_000, 100_000_000, replace=False)                                           
9.01 s ± 70.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit choice(10_000_000_000, 100_000_000, replace=False)                                          
9.99 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for tail Fisher-Yates shuffle:

In [2]: %timeit choice(10_000, 10_000, replace=False)      
172 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit choice(100_000, 100_000, replace=False)    
1.62 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit choice(1_000_000, 1_000_000, replace=False)                                         
24.6 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit choice(10_000_000, 10_000_000, replace=False)                                                 
573 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit choice(100_000_000, 100_000_000, replace=False)                                               
6.95 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So Floyd is actually slightly faster. I'm not quite sure why. Maybe _shuffle_raw needs to handle strides, etc... which slows it down.

bashtage · 2019-06-21T20:54:33Z

Given your timings it is tempting to use this in all cases. Plus it reduces a branch.

I can see how your specialized hash table is much faster than using a Python set.

thrasibule · 2019-06-22T01:12:20Z

How does this compare to #10124 if p is set to 1/pop_size?

This algorithm is O(n) since it needs to draw n exponentials, and then find the top k values which is also O(n) https://en.wikipedia.org/wiki/Selection_algorithm. So worse than Floyd.

wkschwartz · 2019-06-23T18:02:08Z

Can this close #5158?

bashtage · 2019-06-26T15:04:09Z

@mattip This is worth serious consideration for 1.17, when you have time. It is much easier to get this in now than to wait until after 1.17 since it will change the ordering for a given seed/bit generator. This implementation is a far superior one to the naive version that is currently in Generator.

charris · 2019-06-26T17:58:06Z

Needs rebase. Is that test correct? @bashtage Just to be sure, you think this should go in?

bashtage · 2019-06-27T10:24:18Z

It is a definite improvement, and since changing the stream is still difficult (just not impossible), it is better to get it in now rather than later.

@thrasibule could you do a simple statistical analysis of the changes to make sure that it appears to have the correct properties? You couldn't for example, sample k w/o replacement from arange(0,100) many times and then compute the historgram of the empirical frequency of any observation appearing. You could also do more sophisticated tests, but a histogram is simple and would likely show any glaring issues.

thrasibule · 2019-06-27T17:42:55Z

@bashtage something like that ?

In [1]: import numpy as np                                                     

In [2]: from numpy.random.generator import choice                              

In [3]: nsim = 10_000 
   ...: sample_size = 100_000 
   ...: for k in [100, 200, 300, 500, 1000, 5000, 10_000]: 
   ...:     counts = np.zeros(100_000) 
   ...:     for l in range(10_000): 
   ...:         counts += np.bincount(choice(100_000, k, replace=False), minlen
   ...: gth=100_000) 
   ...:     expected_var = nsim * k/sample_size * (1-k/sample_size) 
   ...:     print(counts.var(), expected_var) 
   ...:                                                                        
9.9686 9.99
20.0336 19.96
29.98722 29.91
49.85672 49.75
99.60878 99.0
478.04698 475.0
903.26496 900.0

thrasibule · 2019-06-27T17:52:20Z

I have a couple of questions:

should we keep the heuristic which switches to tail shuffle, or just use Floyd?
I've disabled wraparound (cython.wraparound(False) ) in the cython block that I wrote, which makes sense since I build a power of 2 sized hash table to specifically have fast wraparound, so no need for cython to do its slow wraparound on top. I noticed the choice method is marked cython.wraparound(True), even though it doesn't look like code in that method relies on the wraparound. Should I just remote that line then?
One difference between the tail shuffle algorithm and Floyd, is that with tail shuffle, the sample will come in random order. In particular any slice of size k' < k will be a valid sample without replacement. Whereas with Floyd, the order is arbitrary but not uniform among the permutations of size k. This is especially clear when n=k, where Floyd will always return 0..n-1 in that order. Maybe this should be mentioned in the docstring?

charris · 2019-06-28T17:33:25Z

One difference between the tail shuffle algorithm and Floyd, is that with tail shuffle, the sample will come in random order. In particular any slice of size k' < k will be a valid sample without replacement. Whereas with Floyd, the order is arbitrary but not uniform among the permutations of size k. This is especially clear when n=k, where Floyd will always return 0..n-1 in that order. Maybe this should be mentioned in the docstring?

Is this a concern? I vaguely recall that application was mentioned at some point with regard to bootstrapping(?) Or maybe that was for selection with replacement.

bashtage · 2019-06-28T17:45:35Z

One difference between the tail shuffle algorithm and Floyd, is that with tail shuffle, the sample will come in random order. In particular any slice of size k' < k will be a valid sample without replacement. Whereas with Floyd, the order is arbitrary but not uniform among the permutations of size k. This is especially clear when n=k, where Floyd will always return 0..n-1 in that order. Maybe this should be mentioned in the docstring?

Is this specific to this implementation? I don't think that is true for the Floyd implementation in master (just checked, it is not, e.g., g.choice(np.arange(100),100)). This is undesirable IMO, and feels a bit like an API change since one really needs to shuffle the choice to get the same API as the current one when n=k.

Can this be worked around if a different hash implementation was used (i.e. faster than set() but perhaps slower than your table)?

This makes switching between Floyd and tail seem stranger as well.

I've disabled wraparound (cython.wraparound(False) ) in the cython block that I wrote, which makes sense since I build a power of 2 sized hash table to specifically have fast wraparound, so no need for cython to do its slow wraparound on top. I noticed the choice method is marked cython.wraparound(True), even though it doesn't look like code in that method relies on the wraparound. Should I just remote that line then?

This is probably vestigial, but one has to be certain.

bashtage · 2019-06-28T17:47:41Z

Just a thought -- could you store a second table that has the order information for each value selected, and then do a quick reorder at the end?

So that when n=k, the order that each value is selected would be preserved in the output. For example, if 0 is selected 10th, then out[9] = 0, while if 0 is selected last, then out[-1] = 0.

thrasibule · 2019-06-28T18:01:43Z

Is this a concern? I vaguely recall that application was mentioned at some point with regard to bootstrapping(?) Or maybe that was for selection with replacement.

It depends what the use case is. A specific use case could be to have the sample in sorted order actually, so having it shuffled doesn't bring any benefit.

Is this specific to this implementation? I don't think that is true for the Floyd implementation in master (just checked, it is not, e.g., g.choice(np.arange(100),100)). This is undesirable IMO, and feels a bit like an API change since one really needs to shuffle the choice to get the same API as the current one when n=k.

No the implementation in master already had the same issue. You couldn't see it because the heuristic does a tail shuffle if k=n. When k <n, with Floyd's algorithm, the indices will tend to be lower at the beginning of the sample than at the end of the sample.

I agree that the change in behavior is strange if we keep both implementations. So we can either just keep Floyd's and specify in the doc string that the sample is not a permutation. Or keep both and do a shuffle at the end of Floyd's algorithm (this will still be O(k)).

bashtage · 2019-06-28T20:23:35Z

Yup - you are completely correct. I see it now. (I also had the wrong command above).

IMO we need a shuffle at the since any users who expected similar behavior w.r.t. order from RandomState could be very surprised. This much have an impact on the break-even point I suspect. Around 50%?

bashtage · 2019-06-28T20:24:20Z

If you want to consider a plain Floyd with no shuffle then I think this needs to go to the mailing list since it seems to be more than a refactor.

charris · 2019-06-30T16:39:55Z

@bashtage What should we do with this for 1.17?

wkschwartz · 2019-06-30T16:53:38Z

I am just one user, but I am in favor of including the new algorithm without a shuffle at the end in v1.17. I could really use the fastest choice I can get my hands on. If I need a random permutation, I’ll call choice followed by permutation.

This is the one chance to simplify the API. Generator already is not a drop in replacement for RandomState because they’re not stream compatible.

If the change in API is too much to stomach, at least expose the fast path in a separate function so I can call it without the shuffle.

wkschwartz · 2019-07-07T13:51:55Z

You’ve already done all the experiments to figure out the heuristic, which is just an extra line when you add in the branch for the keyword-dependent shuffle. Is it really that much more complex of code that you’d let performance suffer?

bashtage · 2019-07-07T14:59:00Z

You’ve already done all the experiments to figure out the heuristic, which is just an extra line when you add in the branch for the keyword-dependent shuffle. Is it really that much more complex of code that you’d let performance suffer?

I thought it was more involved, but in psuedo-code I think it would be:

if pop_size is small or size/pop_size is not too big or not shuffle:
       floyd
       if shuffle:
           do_shuffle
else:
      tail_shuffle

This said, I still think there is a case to be made for limiting the allocation of a large array when it isn't actually required, but then the tuning requires a lot of guess-work. A better solution would probably be to add something to the notes that explains that using shuffle=False minimizes memory use, and if you need a shuffle, then you can always do it later. But this is also difficult to fully explain since it would only be completely accurate in some circumstances.

thrasibule · 2019-07-07T15:54:06Z

I was thinking more something along these lines:

if shuffle:
    cutoff = 50
else:
    cutoff = 20

if pop_size is small or pop_size // size > cutoff:

       floyd

       if shuffle:

           do_shuffle

else:

      tail_shuffle

Seems like no solution is perfect anyway unless we expose all the knobs. I'll try to send an email to the mailing list summarizing the discussion so far. In any case, whatever gets decided will be strictly faster than before, so it's still a net win, even if not optimal in all situations.

mattip · 2019-07-10T00:32:39Z

Sent the shuffle=True kwarg change to the mailing list

numpy/random/tests/test_generator_mt19937.py

mattip · 2019-07-11T04:12:58Z

numpy/random/generator.pyx

+                                hash_set[loc] = j
+                                idx_data[j - pop_size_i + size_i] = j
+                        if shuffle:
+                            self._shuffle_int(size_i, 1, idx_data)


Is there a reason not to reuse the existing shuffle or _shuffle_raw ?

I've added _shuffle_int as a specialized version of _shuffle_raw because it's ~30% faster for integers.

benchmarks/benchmarks/bench_random.py

- use a simple hash table with linear probing - adjust heuristic

thrasibule · 2019-07-11T17:44:30Z

Thanks @mattip for the post on the mailing list. I've tried to address all the suggestions so far.

I've also experimented with using a bitset instead of a hash table. It makes sense to use it from a memory perspective when k>=N/64. It's also slightly faster because testing for an element is just a bitshift and an and operation. I've written the code in a different branch, see here. I've refactored the logic of the Floyd algorithm into 2 functions which I've added as static methods to the Generator class. I'm not sure it's the best place to put them. Maybe generator.pxd would be better. Anyway here are some timings: it's faster in the shuffle=False, k>=N/50 range.

N,k	HashSet, shuffle=False	BitSet, shuffle=False	HashSet, shuffle=True	Bitset, shuffle=True
N=1e8, k=1e5	1.97 ms ± 12.4 µs	2.07 ms ± 13.3 µs	2.79 ms ± 19.6 µs	3.02 ms ± 35.9 µs
N=1e8, k=1e6	54.3 ms ± 1.06 ms	52.8 ms ± 1.71 ms	67.6 ms ± 639 µs	65.1 ms ± 1.7 ms
N=1e8, k=2e6	125 ms ± 1.09 ms	104 ms ± 1.36 ms	175 ms ± 2.09 ms	147 ms ± 1.44 ms
N=1e8, k=5e6	354 ms ± 6.78 ms	265 ms ± 4.75 ms	372 ms ± 2.38 ms	378 ms ± 9.21 ms
N=1e8, k=1e7	610 ms ± 8.72 ms	507 ms ± 6.06 ms	625 ms ± 9.38 ms	629 ms ± 6.28 ms
N=1e8, k=5e7	2.54 s ± 35.6 ms	2.09 s ± 44.1 ms	2.53 s ± 11.5 ms	2.62 s ± 13.3 ms
N=1e8, k=1e8	4.81 s ± 77 ms	2.93 s ± 21.2 ms	4.74 s ± 65.6 ms	4.85 s ± 49.5 ms

charris · 2019-07-13T16:27:58Z

Thanks @thrasibule .

charris changed the title ~~Rewrite Floyd algorithm~~ MAINT: Rewrite Floyd algorithm Jun 20, 2019

charris added 03 - Maintenance component: numpy.random labels Jun 20, 2019

bashtage reviewed Jun 21, 2019

View reviewed changes

charris added this to the 1.17.0 release milestone Jun 26, 2019

thrasibule force-pushed the fast_floyd branch from 35d64a6 to 0efaea3 Compare June 27, 2019 18:54

mattip reviewed Jul 11, 2019

View reviewed changes

numpy/random/tests/test_generator_mt19937.py Show resolved Hide resolved

mattip reviewed Jul 11, 2019

View reviewed changes

benchmarks/benchmarks/bench_random.py Outdated Show resolved Hide resolved

thrasibule added 13 commits July 11, 2019 10:00

Rewrite Floyd algorithm

e731222

- use a simple hash table with linear probing - adjust heuristic

use typed memoryview instead of C array

906808d

no wraparound

6ca55c4

fix tests

ff390dc

prevent warnings

2657d49

fix tests

10000 3ca1147

faster shuffle

fdbd9c1

extra shuffle at the end and tweak heuristic

86d7d20

nogil

68dc7d1

fix tests

7b350bd

add shuffle keyword

4c5f758

add benchmark

47fd29d

test shuffle keyword

2e44812

thrasibule force-pushed the fast_floyd branch from 9f9ad38 to 2e44812 Compare July 11, 2019 17:41

charris merged commit 66c6793 into numpy:master Jul 13, 2019

charris removed 09 - Backport-Candidate PRs tagged should be backported 54 - Needs decision labels Jul 13, 2019

charris removed this from the 1.17.0 release milestone Jul 13, 2019

mattip mentioned this pull request Jul 13, 2019

MAINT: Rewrite Floyd algorithm #13972

Merged

thrasibule deleted the fast_floyd branch July 15, 2019 00:56

lorentzenchr mentioned this pull request Nov 15, 2020

Support numpy.random.Generator and/or BitGenerator for random number generation scikit-learn/scikit-learn#16988

Open

Uh oh!

MAINT: Rewrite Floyd algorithm #13812

MAINT: Rewrite Floyd algorithm #13812

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!