8000 np.random.choice without replacement or weights is less efficient than random.sample? · Issue #2764 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

np.random.choice without replacement or weights is less efficient than random.sample? #2764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yoavram opened this issue Nov 23, 2012 · 34 comments

Comments

@yoavram
Copy link
yoavram commented Nov 23, 2012

I was interested in choosing k elements out of n without replacement or weighted probabilities, and with k<<n.

I've tested using the following ipython notebook: source / view.

The outcome was that using random.sample(xrange(n),k) was ~50-fold faster than using np.random.permutation(n)[:k], which is the way it is currently coded.

@seberg
Copy link
Member
seberg commented Nov 23, 2012

It seems one could basically copy paste the code from np.random.permutation. Should I do that in PR gh-2727? The thing is, this seems like an obvious and simple improvement, but it would leave the random number generator in a different state so it is probably better to do it now rather then later, since someone might want to be able to rely on reproducible results over versions.

@seberg
Copy link
Member
seberg commented Nov 23, 2012

Actually, sorry forget that, its not quite that simple of course... you would probably need a point where to switch methods or only optimize for few rather then many elements taken.

@yoavram
Copy link
Author
yoavram commented Nov 26, 2012

Or at list have a flag argument that allows the user to decide if k<<n

@seberg
Copy link
Member
seberg commented Nov 27, 2012

@yoavram, honestly I think an extra flag is likely confusing (and if you want to change it the k<<n special case might be more important since there is an existing function to handle the other case). If it bothers you, maybe you can implement this? Maybe even with a simple speed test for some simple switching rule (or likely ie. the R function already has good code to inspire from)... I am not quite sure how its handled but if its soon maybe you can convince to get it into 1.7. to avoid changing the RNG state later, which might need warnings, etc.

@yoavram
Copy link
Author
yoavram commented Oct 2, 2015

Going back to this, scipy/scipy#3650 pretty much ironed out the details on how to approach this:

  • if n < 3k, use np.random.permutation
  • if n > 3k, use random.sample
    @seberg would you like me to make a PR? If so, should I write a test and where?

@seberg
Copy link
Member
seberg commented Oct 2, 2015

@yoavram I am happy to see improvements, but take care about the random number stream. There was some thoughts on how to change versions, etc. but I am not quite sure about the state of it right now.
The thing is we must not change the stream of random numbers unless if someone used seeding, etc. I think there was some other discussion about the efficiency of choice as well if you search the issues.

Anyway, tests go just in the tests/test_random.py file in the random subfolder.

@njsmith
Copy link
Member
njsmith commented Oct 2, 2015

NB: the current implementation using permutation is also horribly memory-inefficient for k << n (see #5299).

And right, historically the main reason that we haven't fixed such things is that we haven't wanted to change the implementation of RNG methods because we want results to be deterministically reproducible across versions when someone uses a seed.

The solution to that problem has been discussed extensively and agreement reached on how to handle it, via adding a "version" concept to RandomState -- there's an implementation in #5911.

That PR is good on the actual versioning logic, but is stalled out b/c it doesn't make sense to merge the logic for handling improved RNG methods until we actually have at least one improved RNG method implementation :-). #5911 attempted to do this by improving random.choice actually, but it turns out that handling the k << n case is somewhat tricky because it requires a set data structure. (If you look at the implementation of random.sample, it requires a set in this case. Their algorithm is also quite inefficient when k ~ n, because it's pure rejection sampling. Floyd's algorithm is beautiful and efficient for arbitrary k, but again requires a set data structure. Though Floyd's algorithm does also require a permutation as a post-processing step.)

It's certainly possible to include an efficient set data structure for ints into np.random, which is what we'd need here -- pandas uses a Cython wrapper around khash for such things. Importing khash may be a bit more of a job than you were hoping for, but it is certainly doable :-)

@yoavram
Copy link
Author
yoavram commented Oct 2, 2015

OK, I'm a bit lost. Can we make a list of action bullets of what needs to be done? And also, what's wrong with python's native set?

@njsmith
Copy link
Member
njsmith commented Oct 3, 2015

Todo list:

  • Implement a better version of choice inside np.random. Ideally one whose memory usage is O(k) and with some benchmarks and tuning to make sure it's efficient across a variety of k and n values.
  • Implement a framework for managing the "version" of a RandomState object. The basic problem here is that if someone has existing code that does r = np.random.RandomState(seed=0); r.choice(...) then this should behave the same in all versions of numpy -- which means that we'll have to keep the old bad implementation of choice around and fall back to it in some cases. We have a good design for how to handle this, and most of the necessary implementation can be found in Allow multiple versions of RandomState: init' work. #5911. Reading the discussion on that PR should give you a sense of what's going on and what still needs to be done (if anything).

The problem with Python's native set is just that it works with Python objects instead of raw ints, so it's relatively slow and memory hungry compared to a C-level set implementation. On further thought, I guess this might be fine for a first version (though it would be good to see the benchmarks), b/c it's only a constant factor slowdown and -- crucially -- we can always swap out the set implementation without affecting the random stream.

@ghost
Copy link
ghost commented Oct 22, 2015

The algorithm in (Vitter, 1987) also may be worth considering, because it doesn't rely on a set data structure. I ported the code in the article to Python, see this gist, but I didn't attempt to understand precisely how it works... I made a very literal transcription (multiple statements on a single line, etc.) to try and minimize errors, so the style can be improved.

The algorithm consists of two sub-algorithms, a simple one with O(N) complexity and a more complex one, to reach an overall complexity of O(n) (choosing a sample of length n out of a population of length N). The complicated sub-algorithm is based on the "acceptance-rejection method of Von Neumann", says (Nair, 1990). This second article also states efficiency improvements, but these are not included in the gist.

Another implementation of the same algorithm in C can be found here: https://github.com/WebDrake/GrSL

For the amount and complexity of the code, I don't advocate including it in Numpy. In addition, a subsequent shuffle is required, as the algorithm returns an ordered sample. But maybe it's nice to include in some benchmarks. For comparison, there's also scikit-learn's sample_without_replacement which implements "tracking selection", "pool" and "reservoir sampling" algorithms in Cython.

References

Vitter JS (1987) 'An efficient algorithm for sequential random 
      sampling.'  ACM T. Math. Softw. 13(1): 58--67.
Nair KA (1990) 'An improved algorithm for ordered sequential 
      random sampling.'  ACM T. Math. Softw. 16(3): 269--274.

(A subscription is needed to read the second article, Google Scholar knows how to find the first one)

@njsmith
Copy link
Member
njsmith commented Oct 22, 2015

Wow, you weren't kidding about that being complicated :-). Having stared at the code for a bit, I still have no insight into how it works or what the complexity is, but, I am highly suspicious of all the use of floating point math. It seems incredibly unlikely that the Vitter algorithm is correct or consistent on processors that cannot do exact real arithmetic (i.e. all of them). One of the first steps in the first function is to cast N to floating point, but one of our motivating use cases (#5299) is N = 2**64, which is in the range where not all integers are even representable as floating point.

The scikit-learn code doesn't seem to have any big surprises, but agreed that it's a good thing to look at for comparison.

@argriffing
Copy link
Contributor

OK, I'm a bit lost. Can we make a list of action bullets of what needs to be done?

Several np.random improvements (e.g. ziggurat for normal, stick breaking for dirichlet, Floyd's for random choice) are blocked by

#2727 (comment):

It's critical that these methods be deterministic (given the seed) in order to allow people to, e.g., reproduce scientific results across numpy releases.

and are waiting for a change like #5911 to be merged. Although I don't need this determinism across releases, there is not consensus to break it in numpy. I'd like "We have a good design for how to handle this, and most of the necessary implementation can be found in #5911" to be true but I don't know what is its actual status.

@njsmith
Copy link
Member
njsmith commented Oct 22, 2015

I'd like "We have a good design for how to handle this, and most of the necessary implementation can be found in #5911" to be true but I don't know what is its actual status.

AFAIK that's its actual status :-). Are you concerned about something about the design? The mailing list thread discussing this (and apparently reaching consensus that it's a good way forward) is here: https://mail.scipy.org/pipermail/numpy-discussion/2015-May/072952.html

The above comment is a little confusing, because AFAIK the main blocker on #5911 getting merged is actually having a good use case (because we don't like merging code that isn't actually used). If choice turns out to be complex and we want to get the changes in #5911 in, then maybe someone should implement a stick-breaking dirichlet to push this through? That's a relatively trivial algorithm. (Disclaimer: based on my vague memories of implementing it some time ago, YMMV :-).)

@ghost
Copy link
ghost commented Oct 23, 2015

one of our motivating use cases (#5299) is N = 2**64

But I think currently the allowed range is up to np.iinfo(np.intp).max, and going beyond that would complicate the code for choice too much, I would guess? (I'm not at all familiar with Numpy development)

@rkern
Copy link
Member
rkern commented Oct 23, 2015

Sure, N=2**63 then for typical 64-bit platforms. The point is that it is greater than 2**53 the point at which double stops being able to represent integers exactly.

@ghost
Copy link
ghost commented Dec 4, 2015

I've made a couple Cython implementations of Floyd's algorithm. They only differ by their set data-structures:

  • Python set.
  • Linear probing.
  • Quadratic probing with array size a power of 2. This is the same design used by khash.

I ran benchmarks against 3 other functions:

  • The current imlementation of choice.
  • Sklearn, see their code here.
  • An algorithm for ordered selection due to Ahrens & Dieter (1985).

The algorithm I meantioned earlier (Vitter 1987) I didn't consider anymore, because of the precision problems pointed out by njsmith (the article mentions this issue as well).

Code and benchmark results are in this gist.

I hope this is somehow useful :) What would be the next step?

@seberg
Copy link
Member
seberg commented Dec 4, 2015

My guess is, you create a PR based on the work of @anntzer in gh-5911, since it sounds like you are probably happy with at least one of the implementations and PR's are a pretty nice way to comment the code. Of course if we still have to discuss the algorithm choice, maybe that might make sense first.

Either we put in gh-5911 first, and then finish your PR, or you can clone anntzer's work, and include it/help with the finishing touches.

@anntzer
Copy link
Contributor
anntzer commented Dec 5, 2015

Please let me know if you need any help, I'd be happy to see my PR in.

@ghost
Copy link
ghost commented Dec 5, 2015

I'll try to make a PR.. But I've never used git actually, so I will read up on that and need some time to try stuff out. I will ping you @anntzer when I get lost.

As for my implementation, I've found there are still some specific conditions where addition to the set has quadratic run time, so I'm also looking into that. (Or rather the whole sampling process has quadratic run time, but it goes wrong at the set addition..)

@anntzer
Copy link
Contributor
anntzer commented Dec 5, 2015

On a side note I see that you mentioned that shuffle dominates the runtime of the fastest algorithm. In fact numpy's shuffle is not very good, see #5514 and #6776.

@ghost
Copy link
ghost commented Dec 9, 2015

The PR is in. I figured it would be best to keep the discussion separate from the versioning logic and worry about merging later. Maybe github can merge both #5911 and mine without conflicts? Still very new to this..

I accidentally referenced this issue from my commit, will try to avoid that in the future :)

@ryanpeach
Copy link

Any progress on this, I implemented my suggestion and showed some math to back it up.

@ryanpeach
Copy link
ryanpeach commented Oct 12, 2017

So this is my code snippet:

idx = set()
while len(idx)<size:
    idx.add(self.randint(0, pop_size))
idx = np.array(list(idx))
np.shuffle(idx)

And here is my suggestion:

NOTE: This math is wrong. P(O(k))=n!/(n^k (n-k-1)!) log_n(n!)>k+log_n((n-k-1)!)

The way this is currently written is O(n) where n is your array size. In the small case, however, with large arrays, this is very slow. The following algorithm's best performance is O(k) where k is your sample size. The worst case is unfortunately unbounded. There is always a probability that the random sample will continue to select the same values every time.

However, the probability that it will work in O(k) time, is the following:

$P(O(k)) = \frac{n-1}{n} \cdot \frac{n-2}{n} \cdot ... \cdot \frac{n-k}{n} = \frac{n!}{n^k}$

Lets say we want it to choose this task if the probability of O(k) time is > .99 percent, and call that z.

$P(O(k)) > .99 = z$

That probability will be such that n! < z*n^k where z = .99.

$n! < z*n^k$

Let's cancel z out instead and say z=1 (approximately). Thus, this algorithm should run whenever:

z \approx 1.

$n! < n^k$

Further:

$log_n(n!)<log_n(n^k)$

$log_n(n!)<k$

$log(\Gamma (1 + n))/log(n)<k$

Sterlings approximation says:

$log(\Gamma (1 + n)) = ((1+n)-1/2) \cdot log(1+n) - (1+n) + (1/2) \cdot log(2\pi)$

$= (n+1/2) \cdot log(1+n)- n + (1/2) \cdot log(2\pi) - 1$

So to wrap it all up, our new code should run whenever:

$\frac{(n+1/2) \cdot log(1+n)- n + (1/2) \cdot log(2\pi) - 1}{log(n)}<k$

Hope that helps!

@ryanpeach
Copy link

I would just keep the python set implementation for things that meet this threshold, and switch back to the old code if they don't. Then you can implement full-on Floyd's algorithm if you ever get a native set :)

@ryanpeach
Copy link

Note that I need to flip the inequality after the log, but the images are hard to go update so I'll just say so lol.

@ryanpeach
Copy link
ryanpeach commented Oct 12, 2017

For anyone who wants a method like this temporarily themselves, here you go:

import numpy as np
import warnings
from scipy.special import gammaln

def choice(a, size=None, replace=True, p=None):
    n, k = int(a) if isinstance(a, (int, float)) else len(a), size
    if n > k >= 3 and replace is False and p is None and \
       (gammaln(n+1)-gammaln(n-k))/np.log(n) > k:
        idx = set()
        while len(idx) < size:
            idx.update(np.random.randint(0, n, size=k - len(idx)))
        idx = np.array(list(idx))
        np.random.shuffle(idx)
        return idx if isinstance(a, (int, float)) else a[idx]
    else:
        warnings.warn("Slow choice.")
        return np.random.choice(a, size, replace, p)

Added @bashtage 's suggestion and fixed my math.

@bashtage
Copy link
Contributor

In pure Python one could do better for moderately large size by block drawing randoms and using

m = 0
while m < size:
    idx.update(np.random.randint(0, n, size=k - m))
    m = len(idx)

I think it should never be worse in fact (slightly dependent on add vs update)

@ryanpeach
Copy link

I think that is a better implementation! However, it's mathematically equivalent, so I think it's still only O(k) under my constraint equation.

@ryanpeach
Copy link

Also the constraint equation gives us a good threshold considering the slowness of "pure python" weighed against the better complexity of the algorithm, vs the speed of numpy given the worse complexity of the current algorithm. Now we have a definition for k << n and we can use that to choose one over the other.

bashtage added a commit to bashtage/randomgen that referenced this issue Apr 13, 2019
Add tail shuffle to choice and rule to switch between tail and Floyd

xref numpy/numpy#5299
xref numpy/numpy#2764
xref numpy/numpy#9855
bashtage added a commit to bashtage/randomgen that referenced this issue Apr 13, 2019
Add tail shuffle to choice and rule to switch between tail and Floyd

xref numpy/numpy#5299
xref numpy/numpy#2764
xref numpy/numpy#9855
bashtage added a commit to bashtage/numpy that referenced this issue Apr 13, 2019
Improve performance in all cases
Large improvement with size is small

xref numpy#5299
xref numpy#2764
xref numpy#9855
xref numpy#7810
mattip pushed a commit to mattip/n F438 umpy that referenced this issue May 20, 2019
Improve performance in all cases
Large improvement with size is small

xref numpy#5299
xref numpy#2764
xref numpy#9855
xref numpy#7810
mattip pushed a commit to mattip/numpy that referenced this issue May 20, 2019
Improve performance in all cases
Large improvement with size is small

xref numpy#5299
xref numpy#2764
xref numpy#9855
xref numpy#7810
@bashtage
Copy link
Contributor
bashtage commented Jun 1, 2019

@mattip Fixed in Generator, can't be fixed in RandomState

In [18]: g = np.random.Generator()

In [19]: n=np.iinfo(np.int).max

In [20]: n
Out[20]: 2147483647

In [21]: k = 100

In [22]: %timeit g.choice(n,k,replace=False)
31.8 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [23]: %timeit random.sample(range(n),k)
191 µs ± 91.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@mattip
Copy link
Member
mattip commented Jun 1, 2019

OK, closing. For performance, use the new APIs

@ryanpeach
Copy link

Where is the new API documented?

@ryanpeach
Copy link

This is good news, thanks all.

@bashtage
Copy link
Contributor
bashtage commented Jun 5, 2019

@mattip mattip closed this as completed Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants
0