ENH: tracking issue for merging randomgen into numpy #13164

mattip · 2019-03-20T14:09:37Z

mattip · 2019-03-20T14:09:55Z

mattip · 2019-03-24T07:35:41Z

The first three tasks are completed.

mattip · 2019-04-10T04:19:27Z

marked more tasks as completed as #13163 has progressed. Added two more tasks to refactor the namespaces and more tightly integrate documentation

bashtage · 2019-04-10T17:58:09Z

Should remove random_integers. This has been deprecated since 1.11.

mattip · 2019-04-10T18:49:06Z

mattip commented

Apr 10, 2019

•

Should remove random_integers. This has been deprecated since 1.11.

Edit: reverted this change. We need to discuss this more widely, there still seem to be some use cases for random_integers when using a np.int-type max value, then high +1 will overflow

bashtage · 2019-04-14T23:13:50Z

bashtage · 2019-04-14T23:14:58Z

@mattip This isn't quite right. You can generate every possible value using randint for a dtype, including one about the max value. random_integers is redundant.

bashtage · 2019-04-14T23:18:06Z

For example

import numpy as np
from randomgen import RandomGenerator, MT19937
rg = RandomGenerator(MT19937(0))
rg.randint(0, 2**64, dtype=np.uint64)
rg.randint(0, 2**32, dtype=np.int32)

charris · 2019-04-15T00:54:03Z

I think the problem comes when the range is a NumPy integer type, which is unavoidable for broadcasting if we go that way. Python ints are fine. In the old generator the c-level routine worked with closed intervals, so all that was needed was 2**64 - 1, which is a safe 64 bit value. That said, I don't know how randomgen does things.

eric-wieser · 2019-04-15T02:55:56Z

The problem @charris mentions has come up on github at least once before. Some options I think I remember seeing:

randint(0, 2**64-1, closed=True), randint(0, 2**64-1, bounds='closed') or similar spellings to use a closed interval instead of a half-open interval
randint(0, 0, allow_empty=False, dtype=np.uint64) or similar spellings to treat the upper bound as if it had overflowed (eg, interpret x, x as the interval [x, x+2**64) instead of the empty interval). In theory allow_empty could vectorize, allowing the user to pass in an out-of-band bit for each item

bashtage · 2019-04-15T06:16:55Z

The underlying C generation code is the same and works on closed intervals so that there are no issues with types in C. The value checking and downcasting are handled using object dtype for u/int64, and the next largest type for all other types.

Broadcasting also works in extreme cases,

import random.generator as g

g.randint(np.zeros((3,1)),np.array([2**64]*3),dtype='uint64')

Out[6]: 
array([[10573188468909976714,   520699726749771223, 16676690052973521388],
       [ 8784019325716569234, 17221063269744199451, 14319133892835349315],
       [ 4530430698426414801,  4248163222813391406,  8521872816997754320]],
      dtype=uint64)

eric-wieser · 2019-04-15T06:39:00Z

The danger comes with code like

arr = np.array(..., dtype=np.int64)  # including int64 max
range_min = arr.min()
range_max = arr.max()
rand_arr = randint(range_min, range_max+1)

How does this behave?

bashtage · 2019-04-15T06:49:02Z

ValueError: low >= high since

range_min, range_max+1
C:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: overflow encountered in longlong_scalars
  """Entry point for launching an IPython kernel.
Out[9]: (-9223372036854775808, -9223372036854775808)

On the other hand

rand_arr = g.randint(range_min, int(range_max)+1)

Now if we want to get rid of randint and only have a bounded interval generator, then that would allow the code to be simplified.

bashtage · 2019-04-15T16:14:35Z

I still think random_integers should be removed, at least for now. It would always be brought back later with a more reasonable set of options and choices. There are two reason why I think that removal is the right choice:

It is completely redundant. It is just a specific, wrapped call to randint with dtype='l'. It doesn't even help in the one case where it theoretically could, since it doesn't case to a Python int internally, so that adding 1 to a np.int64 wraps here, and so you get a high<low error.

In [6]: np.random.random_integers(0,np.int64(2**63-1))

/home/kevin/miniconda3/bin/ipython:1: DeprecationWarning: This function is deprecated. Please call randint(0, 9223372036854775807 + 1) instead
  #!/home/kevin/miniconda3/bin/python
/home/kevin/miniconda3/bin/ipython:1: RuntimeWarning: overflow encountered in long_scalars
  #!/home/kevin/miniconda3/bin/python
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-a7b97ffba7f7> in <module>
----> 1 np.random.random_integers(0,np.int64(2**63-1))

mtrand.pyx in mtrand.RandomState.random_integers()

mtrand.pyx in mtrand.RandomState.randint()

ValueError: Range cannot be empty (low >= high) unless no samples are taken

It's type is platform dependent.
The error messages returns are wrong for random_integers.

bashtage · 2019-04-15T19:14:58Z

The closed=True seems like the easiest fix. Would require virtually no changes except for the u/int64 paths.

bashtage · 2019-04-15T22:06:15Z

Should versionadded tags be removed since this module is fresh?

I mean only in RandomGenerator, not in RandomState

bashtage · 2019-04-16T09:46:57Z

mattip#15 contains a patch that adds the closed kwarg. It isn't going to be merged so that the merits of this approach can be discussed after #13163 is merged.

mattip · 2019-05-14T16:41:06Z

randint(..., closed=False) in mattip#15 was merged, so the new randomgen will not have random_integers.

mattip · 2019-05-29T10:53:13Z

I went quickly through the issues and PRs mentioned above and closed the ones I could do without too much thinking. I will make another pass through soonish. There are still a few open, as well as all those labelled with numpy.random

bashtage · 2019-05-29T15:31:48Z

Do you think some of these need additional tests?

bashtage · 2019-06-01T08:25:51Z

I think #7861 can be closed now since one can use the new random API in low-level Cython code.

mattip · 2019-06-06T06:57:32Z

6D40

Summary for the present status:
The remaining items in the top comment's checklist are enhancements that can be done moving forward, for instance documentation improvements and creating a base class for the BitGenerators.

Looking at the issues and PRs marked with numpy.random, the following seem to be possible blockers for final acceptance of the randomgen branch:

ENH: prevent access to default BitGenerator #13650 prevent access to the default BitGenerator if one is not specified when Generator() is called (decision needed)
Decide on new PRNG BitGenerator default #13635 maybe change the default BitGenerator (so far it seems Xoshiro256 was a good choice) (decision needed if we want to change the default)
Add axis option to numpy.random.shuffle #11583, ENH: provide numpy.random.shuffle(a, b, ...), 2 arrays in unison #8204, ENH: More efficient algorithm for unweighted random choice without replacement #5158 ENH: Alternative to random.shuffle, with an axis argument. #5173 ENH: Faster algorithm for np.random.choice. #10124 ENH: optimizing numpy.random.choice() for user input probabilities. #10432 all relate to changes to shuffle and choice that now is the time to adopt or close.
BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443, BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475 relate to problems with hypergoemetric when given strange input
BUG: np.random.zipf hangs the interpreter on pathological input #9829 relate to problems with zipf when given strange input
multinomial casts input to np.float64 #8317 where multinomial fails to use p when it is float32

A number of other issues propose enhancements and clarifications on best practices, like #6132, or #9650 that can be done on an ongoing basis.

charris · 2019-06-06T12:49:29Z

I think zipf (and related) needs a range keyword for when a is near 1. The idea is to rescale the sampling so that most draws are in a valid, representable range rather than inf. Doing so speeds things up enormously. Might need a new function name (szipf?, zipfm1?).

charris · 2019-06-06T12:55:26Z

To clarify a bit, the sample scaling is one problem, the other is that with values of a near 1, one wants the small valued tail, and the range sets the limit on the tail. Previously the tail limit was set by the size of long integers.

rkern · 2019-06-12T18:25:25Z

A couple more things:

There is now a better algorithm, thanks to @imneme, for deriving starting states from seeds than the ones we have currently. I'd like to see us use it consistently. It ensures that each bit in the provided seed data has a chance of affecting each bit in the requested output, which means that even sequential seeds, changing only in the LSBs of a very long sequence of bits, will give a good spread in the state space. Which is just a nice property on its own, but it also:
Allows for reproducible PRNG spawning without needing to plan ahead, like .jumped() requires. The guarantees are "merely" probabilistic, but given the strength of the avalanche properties of the seed algorithm and the state sizes of the PRNGs under consideration, I am willing to stake my reputation on recommending this. I've tested PCG32 (the one most likely to fail on size-of-state grounds and what little architectural similarity to the hashing algorithm it has), PCG64, and Xoroshiro256 out to a few TiB each on PractRand with 4096 streams derived by this spawning algorithm (DSFMT fails earlier, but DSFMT just fails with 1 stream). I consider the likelihood of user error with .jumped() much higher than the chance of a collision in the streams. We have a good chance to advance best practices here for a modern set of dynamic parallel programming tools.
I'd like to remove the settable-increments from PCG32 and PCG64 and just treat them as things that get seeded. Users won't pick good ones. We could get around it by hashing the user-provided increment, but I think seeding it is better (and using either .spawn() or .jumped() to get new streams). This would assist in aligning the feature sets of all of the PRNGs, making them more easily replaced with each other. That alleviates some of the worry of adding PCG64 as the default (insofar as having a default with a more full-featured API might cause people to rely on that API).

rkern · 2019-06-12T20:17:15Z

Oops, a slightly old version of that got copy-pasted. I also tried Philox and ThreeFry successfully as well, but I was really unconcerned about them.

This is how I envisioned SeedSequence to be used, to make it explicit:

The BitGenerator constructors can all take an explicit ISeedSequence instance (which could be SeedSequence or anything that implements that interface) or any one of the inputs that SeedSequence takes (which will be converted to SeedSequence) OR a properly-formatted state dict for bitgen.state['state'] if someone wants complete control over the state. This would make the constructors more like each other instead of per-BitGenerator ad hoc arguments like ThreeFry's and Philox's key= argument and PCG32's and PCG64's inc= argument.

# Each of these would be valid ways to instantiate a Philox BitGenerator.
np.random.Philox()  # -> Philox(SeedSequence())
np.random.Philox(None)  # -> Philox(SeedSequence(None))
np.random.Philox(12345)  # -> Philox(SeedSequence(12345)
np.random.Philox([314159, 271828])  # -> Philox(SeedSequence([314159, 271828]))
np.random.Philox(SeedSequence(12345))
np.random.Philox(MyOwnSeedSequenceImplementation(12345))
np.random.Philox({'counter': np.array([0x01, 0, 0, 0], dtype=np.uint64),
                  'key': np.array([ 7639784375349706623, 18058138874403901375], dtype=np.uint64)})

Why the ISeedSequence ABC instead of just the one implementation? At the very least, we need an implementation of the old MT19937 seeding algorithm to reproduce RandomState's pre-1.17 behavior. So RandomState.__init__(bit_generator=None) would have something like:

elif not hasattr(bit_generator, 'capsule'):
    bit_generator = _MT19937(LegacyMTSeedSequence(bit_generator))`

Other generator authors also provide their own recommended ways of expanding integer seeds out to internal states. I'm not concerned about replacing those with SeedSequence for general use (I think this algorithm is at least as good as any of them), but people may want to implement those for comparison purposes with other systems.

We discussed the spawning in the most recent Development Meeting. I think the general consensus was not to expose the API at the level of the Generator, yet. We should probably get a little bit of experience with this out in the wild first before including it in the most prominent API. For 1.17, people experimenting with this facility would instantiate their root SeedSequence, explicitly pass its spawn out to the processes that need it, and use np.random.default_gen(seed_seq) to get the Generator for each. It's possible that we will need to have BitGenerators own and expose the SeedSequence that they create or are passed in order to enable this future API. We won't be able to add it later because third-party implementations won't do that. On the other hand, it's possible that we can implement the handling of SeedSequence in BitGenerator.__cinit__() or BitGenerator.__init__() and make sure all implementations call those. I think that would let third-party implementations not have to handle it explicitly and let us add the ownership at a later time. There might be some Cython details that preclude that; I'm not sure.

I'm also happy to drop the program_entropy= argument for this release. It's something we can add in later when I am able to formulate a better elevator pitch for the concept.

mattip · 2019-06-13T07:41:41Z

In addition to changing seed and jumped, we need to decide about:

which BitGenerators to include in the upcoming numpy release. The general feeling in the last community meeting is to got for a minimum rather than a maximum. That means include two: legacy MT19937 and the Generator default.
which BitGenerator should be the Generator default. In spite of the performance problems, it was felt PCG64 is the best option, even though it is slow on some platforms.

@bashtage: any thoughts? Especially around the idea to continue to maintain a repo of alternative BitGenerators that would not be directly part of this repo?

seberg · 2019-06-13T19:45:08Z

As much as I like the SeedSequence idea, I am a bit unsure whether we need to expose it. I understand the argument that it hides spawn from the Generator or BitGenerator at the moment. But in the long term, there seems little usage for SeedSequence directly? Except maybe that if we carry around a SeedSequence object implicitly that could mean carrying around a largish array if someone uses a large array for seeding.

I suppose the argument of being able to replace the actual seeding procedure is valid, but I am unsure if it is an actual likely use case.

Or is there an API reason for wanting to use a SeedSequence for anything more then spawn (I could imaging spawning a different BitGenerator from an existing one, thus wanting a/its SeedSequence as a start).

On the other hand, if we use it internally anyway, we could just deprecate it at some point and keep it around indefinitely without much harm....

rkern · 2019-06-13T20:37:18Z

To me, the costs entailed by exposing it are not much. Since the BitGenerators do provide stream compatibility guarantees, the bulk of the usual indefinite maintenance cost is already priced in; we have to maintain the algorithm indefinitely. It's just a matter of the precise public API, which can be managed with deprecations, if we need to.

Since exposing SeedSequence as part of the documented API lets us delay promoting a Generator.spawn() API until we have built up some confidence in the concept, I think that cost is worthwhile.

I think the ISeedSequence interface also fits in with the basic concept of this redesign, like the Generator/BitGenerator split. We're giving advanced developers the tools to plug in different implementations and experiment with new options. For example, someone wanting to use the counter-mode crypto PRNGs with strictly incrementing keys to get independent streams. At the same time, no one has to touch any of that, or even think about it; they will just use np.random.default_gen(seed) most of the time.

mattip · 2019-08-18T18:10:13Z

Restating the remaining open random issues from this comment

Add axis option to numpy.random.shuffle #11583, ENH: provide numpy.random.shuffle(a, b, ...), 2 arrays in unison #8204, ENH: More efficient algorithm for unweighted random choice without replacement #5158 ENH: Alternative to random.shuffle, with an axis argument. #5173 ENH: Faster algorithm for np.random.choice. #10124 ENH: optimizing numpy.random.choice() for user input probabilities. #10432 all relate to changes to shuffle and choice that now is the time to adopt or close.
BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443, BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475 relate to problems with hypergoemetric when given strange input
BUG: np.random.zipf hangs the interpreter on pathological input #9829 relate to problems with zipf when given strange input
multinomial casts input to np.float64 #8317 where multinomial fails to use p when it is float32

A number of other issues propose enhancements and clarifications on best practices, like #6132, or #9650 that can be done on an ongoing basis.

@rkern, @bashtage any opinions?

bashtage · 2019-08-30T07:36:36Z

I am:

+1 on the idea in Add axis option to numpy.random.shuffle #11583 and ENH: Alternative to random.shuffle, with an axis argument. #5173 -- axis is standard and implementation should be reasonable. I wouldn't add a new function though. This said, swapaxes is also pretty easy, and an example in the docstring might be all that is needed.
-1 on ENH: provide numpy.random.shuffle(a, b, ...), 2 arrays in unison #8204 since it adds a lot of complexity and can be avoided by merging as many arrays, or by using a small wrapper class if one really wants to avoid merging.
I'm not sure ENH: More efficient algorithm for unweighted random choice without replacement #5158 is still relevant. Would need to rerun timings.
I'm -1 on ENH: Faster algorithm for np.random.choice. #10124 since it depends on argpartition which is not determinstic. AFAICT the perf gains are modest (factor of 2, and not dependent on the size of the array).
Hard to say about ENH: optimizing numpy.random.choice() for user input probabilities. #10432 without implementation.

I think #6132 is . #9650 is solved using SeedSequence although it isn't automatic I think.

mattip · 2019-12-08T13:09:44Z

I would like to close this issue, since it has served its purpose while we were intensively working on random. Not all the issues referenced are closed, but I don't think keeping this open is raising the chance of closing those. Please reopen with more details if you see a purpose in keeping this open.

mattip added 01 - Enhancement component: numpy.random labels Mar 20, 2019

mattip mentioned this issue Apr 10, 2019

ENH: randomgen #13163

Merged

bashtage mentioned this issue Apr 10, 2019

BUG: Add np.random.dirichlet2 #5872

Closed

mattip mentioned this issue Apr 11, 2019

BUG: np.random.zipf hangs the interpreter on pathological input #9829

Closed

This was referenced Apr 14, 2019

ENH: More efficient algorithm for unweighted random choice without replacement #5158

Closed

Speed improvement to np.random.choice under replace=False and p=None #9855

Closed

bashtage mentioned this issue Apr 15, 2019

BUG: Cast high to Python int to avoid overflow mattip/numpy#13

Merged

bashtage mentioned this issue Apr 16, 2019

ENH: Add closed generator to randint mattip/numpy#15

Merged

This was referenced May 23, 2019

RandomGen review notes (feel free to close) #13606

Closed

DOC: Add __all__ and document lock mattip/numpy#31

Merged

bashtage mentioned this issue Jun 2, 2019

Incorrect nan handling in multiple random generators #13283

Closed

mattip mentioned this issue Jun 14, 2019

ENH: use SeedSequence instead of seed() #13780

Merged

luk-f-a mentioned this issue Aug 19, 2019

Predictable RNG sequences under prange. numba/numba#4452

Open

mattip closed this as completed Dec 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: tracking issue for merging randomgen into numpy #13164

ENH: tracking issue for merging randomgen into numpy #13164

ENH: tracking issue for merging randomgen into numpy #13164

ENH: tracking issue for merging randomgen into numpy #13164

Comments