-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
np.random.choice without replacement or weights is less efficient than random.sample? #2764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It seems one could basically copy paste the code from np.random.permutation. Should I do that in PR gh-2727? The thing is, this seems like an obvious and simple improvement, but it would leave the random number generator in a different state so it is probably better to do it now rather then later, since someone might want to be able to rely on reproducible results over versions. |
Actually, sorry forget that, its not quite that simple of course... you would probably need a point where to switch methods or only optimize for few rather then many elements taken. |
Or at list have a flag argument that allows the user to decide if k<<n |
@yoavram, honestly I think an extra flag is likely confusing (and if you want to change it the k<<n special case might be more important since there is an existing function to handle the other case). If it bothers you, maybe you can implement this? Maybe even with a simple speed test for some simple switching rule (or likely ie. the R function already has good code to inspire from)... I am not quite sure how its handled but if its soon maybe you can convince to get it into 1.7. to avoid changing the RNG state later, which might need warnings, etc. |
Going back to this, scipy/scipy#3650 pretty much ironed out the details on how to approach this:
|
@yoavram I am happy to see improvements, but take care about the random number stream. There was some thoughts on how to change versions, etc. but I am not quite sure about the state of it right now. Anyway, tests go just in the tests/test_random.py file in the random subfolder. |
NB: the current implementation using permutation is also horribly memory-inefficient for k << n (see #5299). And right, historically the main reason that we haven't fixed such things is that we haven't wanted to change the implementation of RNG methods because we want results to be deterministically reproducible across versions when someone uses a seed. The solution to that problem has been discussed extensively and agreement reached on how to handle it, via adding a "version" concept to RandomState -- there's an implementation in #5911. That PR is good on the actual versioning logic, but is stalled out b/c it doesn't make sense to merge the logic for handling improved RNG methods until we actually have at least one improved RNG method implementation :-). #5911 attempted to do this by improving random.choice actually, but it turns out that handling the k << n case is somewhat tricky because it requires a set data structure. (If you look at the implementation of It's certainly possible to include an efficient set data structure for ints into np.random, which is what we'd need here -- pandas uses a Cython wrapper around khash for such things. Importing khash may be a bit more of a job than you were hoping for, but it is certainly doable :-) |
OK, I'm a bit lost. Can we make a list of action bullets of what needs to be done? And also, what's wrong with python's native |
Todo list:
The problem with Python's native |
The algorithm in (Vitter, 1987) also may be worth considering, because it doesn't rely on a The algorithm consists of two sub-algorithms, a simple one with Another implementation of the same algorithm in C can be found here: https://github.com/WebDrake/GrSL For the amount and complexity of the code, I don't advocate including it in Numpy. In addition, a subsequent shuffle is required, as the algorithm returns an ordered sample. But maybe it's nice to include in some benchmarks. For comparison, there's also scikit-learn's References
(A subscription is needed to read the second article, Google Scholar knows how to find the first one) |
Wow, you weren't kidding about that being complicated :-). Having stared at the code for a bit, I still have no insight into how it works or what the complexity is, but, I am highly suspicious of all the use of floating point math. It seems incredibly unlikely that the Vitter algorithm is correct or consistent on processors that cannot do exact real arithmetic (i.e. all of them). One of the first steps in the first function is to cast N to floating point, but one of our motivating use cases (#5299) is The scikit-learn code doesn't seem to have any big surprises, but agreed that it's a good thing to look at for comparison. |
Several
and are waiting for a change like #5911 to be merged. Although I don't need this determinism across releases, there is not consensus to break it in numpy. I'd like "We have a good design for how to handle this, and most of the necessary implementation can be found in #5911" to be true but I don't know what is its actual status. |
AFAIK that's its actual status :-). Are you concerned about something about the design? The mailing list thread discussing this (and apparently reaching consensus that it's a good way forward) is here: https://mail.scipy.org/pipermail/numpy-discussion/2015-May/072952.html The above comment is a little confusing, because AFAIK the main blocker on #5911 getting merged is actually having a good use case (because we don't like merging code that isn't actually used). If |
But I think currently the allowed range is up to |
Sure, |
I've made a couple Cython implementations of Floyd's algorithm. They only differ by their set data-structures:
I ran benchmarks against 3 other functions:
The algorithm I meantioned earlier (Vitter 1987) I didn't consider anymore, because of the precision problems pointed out by njsmith (the article mentions this issue as well). Code and benchmark results are in this gist. I hope this is somehow useful :) What would be the next step? |
My guess is, you create a PR based on the work of @anntzer in gh-5911, since it sounds like you are probably happy with at least one of the implementations and PR's are a pretty nice way to comment the code. Of course if we still have to discuss the algorithm choice, maybe that might make sense first. Either we put in gh-5911 first, and then finish your PR, or you can clone anntzer's work, and include it/help with the finishing touches. |
Please let me know if you need any help, I'd be happy to see my PR in. |
I'll try to make a PR.. But I've never used git actually, so I will read up on that and need some time to try stuff out. I will ping you @anntzer when I get lost. As for my implementation, I've found there are still some specific conditions where addition to the set has quadratic run time, so I'm also looking into that. (Or rather the whole sampling process has quadratic run time, but it goes wrong at the set addition..) |
The PR is in. I figured it would be best to keep the discussion separate from the versioning logic and worry about merging later. Maybe github can merge both #5911 and mine without conflicts? Still very new to this.. I accidentally referenced this issue from my commit, will try to avoid that in the future :) |
Any progress on this, I implemented my suggestion and showed some math to back it up. |
So this is my code snippet: idx = set()
while len(idx)<size:
idx.add(self.randint(0, pop_size))
idx = np.array(list(idx))
np.shuffle(idx) And here is my suggestion: NOTE: This math is wrong. P(O(k))=n!/(n^k (n-k-1)!) log_n(n!)>k+log_n((n-k-1)!) The way this is currently written is O(n) where n is your array size. In the small case, however, with large arrays, this is very slow. The following algorithm's best performance is O(k) where k is your sample size. The worst case is unfortunately unbounded. There is always a probability that the random sample will continue to select the same values every time. However, the probability that it will work in O(k) time, is the following: Lets say we want it to choose this task if the probability of O(k) time is > .99 percent, and call that z. That probability will be such that n! < z*n^k where z = .99. Let's cancel z out instead and say z=1 (approximately). Thus, this algorithm should run whenever: Further: Sterlings approximation says: So to wrap it all up, our new code should run whenever: Hope that helps! |
I would just keep the python set implementation for things that meet this threshold, and switch back to the old code if they don't. Then you can implement full-on Floyd's algorithm if you ever get a native set :) |
Note that I need to flip the inequality after the log, but the images are hard to go update so I'll just say so lol. |
For anyone who wants a method like this temporarily themselves, here you go: import numpy as np
import warnings
from scipy.special import gammaln
def choice(a, size=None, replace=True, p=None):
n, k = int(a) if isinstance(a, (int, float)) else len(a), size
if n > k >= 3 and replace is False and p is None and \
(gammaln(n+1)-gammaln(n-k))/np.log(n) > k:
idx = set()
while len(idx) < size:
idx.update(np.random.randint(0, n, size=k - len(idx)))
idx = np.array(list(idx))
np.random.shuffle(idx)
return idx if isinstance(a, (int, float)) else a[idx]
else:
warnings.warn("Slow choice.")
return np.random.choice(a, size, replace, p) Added @bashtage 's suggestion and fixed my math. |
In pure Python one could do better for moderately large size by block drawing randoms and using
I think it should never be worse in fact (slightly dependent on |
I think that is a better implementation! However, it's mathematically equivalent, so I think it's still only O(k) under my constraint equation. |
Also the constraint equation gives us a good threshold considering the slowness of "pure python" weighed against the better complexity of the algorithm, vs the speed of numpy given the worse complexity of the current algorithm. Now we have a definition for k << n and we can use that to choose one over the other. |
Add tail shuffle to choice and rule to switch between tail and Floyd xref numpy/numpy#5299 xref numpy/numpy#2764 xref numpy/numpy#9855
Add tail shuffle to choice and rule to switch between tail and Floyd xref numpy/numpy#5299 xref numpy/numpy#2764 xref numpy/numpy#9855
Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810
Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810
Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810
@mattip Fixed in Generator, can't be fixed in RandomState In [18]: g = np.random.Generator()
In [19]: n=np.iinfo(np.int).max
In [20]: n
Out[20]: 2147483647
In [21]: k = 100
In [22]: %timeit g.choice(n,k,replace=False)
31.8 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [23]: %timeit random.sample(range(n),k)
191 µs ± 91.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) |
OK, closing. For performance, use the new APIs |
Where is the new API documented? |
This is good news, thanks all. |
I was interested in choosing k elements out of n without replacement or weighted probabilities, and with k<<n.
I've tested using the following ipython notebook: source / view.
The outcome was that using
random.sample(xrange(n),k)
was ~50-fold faster than usingnp.random.permutation(n)[:k]
, which is the way it is currently coded.The text was updated successfully, but these errors were encountered: