8000 ENH: Add Floyd's algorithm for RandomState.choice, see #2764 · Pull Request #6801 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Add Floyd's algorithm for RandomState.choice, see #2764 #6801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

ENH: Add Floyd's algorithm for RandomState.choice, see #2764 #6801

wants to merge 3 commits into from

Conversation

ghost
Copy link
@ghost ghost commented Dec 9, 2015

Implement Floyd's algorithm for random sampling without replacement.

This change allows for a faster np.random.choice without replacement or weights, see issue #2764. It requires versioning logic to keep the reproducibility of seeded results, see e.g. discussion in issue #5299 and PR #5911. But I figured it would be best to keep the discussions separate and worry about merging later.. Any comment is appreciated.

Reference:
Bentley, J. Floyd, B., Programming Pearls: A Sample Of Brilliance.
Communications of the ACM, Vol.11, No.9, 1987.

Implement Floyd's algorithm for random sampling without replacement.
Reference:
Bentley, J. Floyd, B., Programming Pearls: A Sample Of Brilliance.
Communications of the ACM, Vol.11, No.9, 1987.

The crux is the implementation of the hash set. After some benchmarks
I settled on the same basic design as used by "khash", see
github.com/attractivechaos/klib/blob/master/khash.h

This means:
* Open-addressing scheme, with quadratic probing.
* The size of the 
8000
underlying array is a power of 2.
* The maximum load factor is chosen to be 0.77.

Other implementation details:
* Negative values are used as flags for empty slots.
* A simple randomizing hash function is used to handle the insertion
  of consecutive integers.
cdef int _floyd_add(self, long key, long *set, npy_intp size) nogil:
cdef long mask, step, i
mask = size - 1
i = 1103515245 * key + 12345
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need some non-trivial comments here :-)

@njsmith
Copy link
Member
njsmith commented Dec 10, 2015

My biggest concern about this is how the ad hoc set implementation might restrict the range of supported values. Ideal would be if we can support any pop_size up to 2**64; it should in any case be documented. IIUC the current implementation only supports up to 2**31 (on 32-bit systems or 64-bit Windows) or 2**63 (on 64-bit non-Windows)?

One way to move the limit up to 2**64 uniformly would be to use uint64 for the hash set, reserve UINT64_MAX for the sentinel value, and store a separate out-of-line flag for whether UINT64_MAX is in the set or not (since we can't store it in the actual hash table).

Any reason we can't just use khash itself btw? It's not a ton of code, but still open-coding a hash set implementation isn't terribly hygenic -- it's easy to get subtle details wrong, end up with a dozen subtly different implementations in different parts of the code, etc.

@ghost
Copy link
Author
ghost commented Dec 10, 2015

It's true, some comments here and there wouldn't hurt! But I wasn't really sure about the appropriate amount and style here. To be clear, conceptually my set implementation is almost exactly the same as khash. But without the need for type generality, removing, resizing, and rehashing, it can be implemented in only a few lines of code. I'm positive we can have a bug-free "ad hoc" implementation.

The only thing I'm not completely sure about is my choice of hash function. The 1103515245 * key + 12345 is a linear congruential RNG, which takes care of spreading out consecutive integers that, in Floyd's algorithm, are added to the set when pop_size is not much bigger than the sample size. I found that without it, under some specific conditions the performance of the set degraded.

But just using khash itself is also an option.. Because it's so macro-heavy I don't think it can be used directly from Cython, but with a separate *.c file it's not too bad. (I threw together something that seems to work.) IMO it's overkill for only this use case, but if there are other places in Numpy where a hash set/map is useful then maybe it's worthwhile?

...the range of supported values.

I made a deliberate choice to only support up to LONG_MAX, to have a consistent return dtype and to keep the code as simple as possible. I modeled this behavior after randint and geometric, which always return something of np.int_ dtype. But I'll gladly change it if necessary.

ldoddema added 2 commits December 12, 2015 22:54
Benchmarks indicate that a load factor of 0.77 is really on the edge 
of exhibiting quadratic complexity for the algo as a whole, without a 
hash function. So for robustness it may be desirable to either reduce 
the load factor or indeed use a hash function.
@ghost
Copy link
Author
ghost commented Dec 12, 2015

I added some comments to the code and removed the unconventional hash function.

Without the hash function, the overall complexity could degrade to O(k2) depending on the choice for maximum load factor (1 / α). For a particular worst-case scenario that I tested against, the currently hard-coded load factor of 1 / 1.3 seems to be acceptable, at least for n/k >= 2:
image

This leaves the concerns about limited range and ad-hoc implementation intact though, but I don't have any new thoughts about those...

@charris
Copy link
Member
charris commented Dec 20, 2015

Closing and opening for testing.

@charris charris closed this Dec 20, 2015
@charris charris reopened this Dec 20, 2015
@ghost
Copy link
Author
ghost commented Jun 21, 2016

I stopped working on this a while ago, closing for the sake of cleanup.

@ghost ghost closed this Jun 21, 2016
@charris
Copy link
Member
charris commented Jun 21, 2016

Thanks for attending to that.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0